hub Canonical reference

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein · 2025 · cs.CV · arXiv 2505.09568

Canonical reference. 71% of citing Pith papers cite this work as background.

83 Pith papers citing it

Background 71% of classified citations

open full Pith review browse 83 citing papers arXiv PDF

abstract

Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. Furthermore, we demonstrate that a sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation-offers practical advantages by preserving image understanding capability while developing strong image generation ability. Finally, we carefully curate a high-quality instruction-tuning dataset BLIP3o-60k for image generation by prompting GPT-4o with a diverse set of captions covering various scenes, objects, human gestures, and more. Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models. BLIP3-o achieves superior performance across most of the popular benchmarks spanning both image understanding and generation tasks. To facilitate future research, we fully open-source our models, including code, model weights, training scripts, and pretraining and instruction tuning datasets.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 24 baseline 4 dataset 2 other 1

citation-polarity summary

background 22 baseline 4 unclear 3 use dataset 2

claims ledger

abstract Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Ground

co-cited works

representative citing papers

Representation Fr\'echet Loss for Visual Generation

cs.CV · 2026-04-30 · unverdicted · novelty 8.0

Fréchet Distance optimized as FD-loss in representation space by decoupling population size from batch size improves generator quality, enables one-step generation from multi-step models, and motivates a multi-representation metric FDr^k.

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

cs.CV · 2026-05-20 · conditional · novelty 7.0

RankE co-evolves AR policy and decoder via alternating ranking optimization, improving both FID and CLIP scores on LlamaGen-XL and Janus-Pro where policy-only RL degrades FID.

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

cs.MM · 2026-05-12 · unverdicted · novelty 7.0

UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixed strategies.

Do-Undo Bench: Reversibility for Action Understanding in Image Generation

cs.CV · 2025-12-15 · unverdicted · novelty 7.0

Do-Undo Bench is a new evaluation task and dataset that forces models to simulate forward action effects and then undo them to measure genuine action understanding in image generation.

AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model

cs.CV · 2025-11-27 · unverdicted · novelty 7.0

AIA loss teaches unified multimodal models task-specific cross-modal attention patterns to reduce conflicts between image understanding and generation without architecture decoupling.

Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.

Exploring Spatial Intelligence from a Generative Perspective

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.

Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.

UniTac: A Unified Multimodal Model for Cross-Sensor Tactile Understanding and Generation

cs.RO · 2026-06-30 · unverdicted · novelty 6.0

UniTac is the first unified multimodal model for cross-sensor tactile understanding and generation, using dual-level representations, two new understanding tasks, and a two-stage training paradigm with sensor-prior sampling to achieve SOTA understanding and realistic cross-sensor generation.

Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

A masked discrete diffusion model adds token editing at inference and grouped cross-entropy training to reach 0.90 GenEval, 86.9 DPG, and 10.76 HPSv3 scores.

Ask, Solve, Generate: Self-Evolving Unified Multimodal Understanding and Generation via Self-Consistency Rewards

cs.CV · 2026-06-25 · unverdicted · novelty 6.0

A self-evolving framework with proposer-solver-generator roles, Solver Token Entropy, and multi-scale internal evaluation improves unified LMMs on understanding and generation tasks using only self-derived consistency signals.

Unison: Benchmarking Unified Multimodal Models via Synergistic Understanding and Generation

cs.CV · 2026-06-25 · unverdicted · novelty 6.0

Unison is a new benchmark with unified and decoupled tracks plus Unison-Judge to measure synergy between understanding and generation in multimodal models.

DiffusionBench: On Holistic Evaluation of Diffusion Transformers

cs.CV · 2026-06-23 · conditional · novelty 6.0

NanoGen unifies DiT training on ImageNet and T2I, reveals negative Pearson correlations (-0.377 to -0.580) in method rankings across metrics from 21 models, and motivates DiffusionBench for holistic evaluation.

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

cs.CV · 2026-06-23 · unverdicted · novelty 6.0

IV-CoT introduces an implicit chain-of-thought framework that decomposes visual queries into a structural-to-semantic cascade with training-only sketch supervision to improve structure-aware text-to-image generation.

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: From Evaluation to Diagnosis

cs.CV · 2026-06-17 · unverdicted · novelty 6.0

FG-BMK benchmark shows current LVLMs are inadequate at fine-grained recognition due to bottlenecks in visual representations, semantic grounding, modality alignment, and category knowledge.

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

cs.CV · 2026-06-16 · unverdicted · novelty 6.0

UniAR uses a shared context-visual tokenizer with bitwise quantization and parallel prediction in an autoregressive framework to unify visual understanding and generation, claiming SOTA on generation and editing tasks.

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

cs.CV · 2026-06-11 · unverdicted · novelty 6.0

HYDRA-X presents the first unified multimodal model using a single ViT for holistic image-video tokenization, with ablations on attention and compression plus a latent-level editing improvement.

CSFlow: Aligning Flow Matching with Human Contrast Sensitivity

cs.CV · 2026-06-07 · unverdicted · novelty 6.0

CSFlow derives inference-time timestep weights for flow matching by matching per-step frequency content to human CSF, yielding 4.7% FID reduction and smaller gains on IS and GenEval.

ProductWebGen: Benchmarking Multimodal Product Webpage Generation

cs.CV · 2026-05-31 · unverdicted · novelty 6.0

Introduces ProductWebGen benchmark for multimodal product webpage generation, compares editing-based vs unified-model workflows on 500 samples, and releases ProductWebGen-1k SFT dataset.

Representation Forcing for Bottleneck-Free Unified Multimodal Models

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

Representation Forcing enables end-to-end pixel-space unified multimodal models by making visual representation prediction a native autoregressive generation target that guides subsequent pixel diffusion in the same backbone.

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

Lumos-Nexus is a training-efficient video generation framework using two-stage alignment of a lightweight model followed by progressive frequency bridging to a high-fidelity generator in homogeneous latent space, plus the new VR-Bench for reasoning evaluation.

GenClaw: Code-Driven Agentic Image Generation

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

GenClaw introduces a three-stage code-driven workflow for agentic image generation that inserts programmatic sketches between linguistic reasoning and pixel synthesis.

Channel-wise Vector Quantization

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

CVQ replaces patch-wise vector quantization with channel-wise quantization of feature maps, enabling a next-channel autoregressive model that reports 100% codebook utilization and text-to-image scores of DPG 86.7 and GenEval 0.79.

citing papers explorer

Showing 18 of 18 citing papers after filters.

Do-Undo Bench: Reversibility for Action Understanding in Image Generation cs.CV · 2025-12-15 · unverdicted · none · ref 8 · internal anchor
Do-Undo Bench is a new evaluation task and dataset that forces models to simulate forward action effects and then undo them to measure genuine action understanding in image generation.
AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model cs.CV · 2025-11-27 · unverdicted · none · ref 4 · internal anchor
AIA loss teaches unified multimodal models task-specific cross-modal attention patterns to reduce conflicts between image understanding and generation without architecture decoupling.
Mull-Tokens: Modality-Agnostic Latent Thinking cs.CV · 2025-12-11 · unverdicted · none · ref 8 · internal anchor
Mull-Tokens are modality-agnostic latent tokens that enable free-form multimodal thinking and deliver up to 16% gains on spatial reasoning benchmarks.
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation cs.CV · 2025-11-24 · conditional · none · ref 5 · internal anchor
DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while closing the gap to latent diffusion methods.
Emu3.5: Native Multimodal Models are World Learners cs.CV · 2025-10-30 · unverdicted · none · ref 15 · internal anchor
Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.
VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models cs.CV · 2025-10-21 · unverdicted · none · ref 4 · internal anchor
VFM-VAE uses a frozen VFM directly as LDM tokenizer via a custom decoder, reaching gFID 2.22 in 80 epochs and 1.62 after 640 epochs.
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning cs.CV · 2025-09-24 · unverdicted · none · ref 4 · internal anchor
EditVerse unifies image and video editing and generation in one transformer model via unified token sequences and in-context learning, trained jointly on curated video editing data plus image/video corpora and evaluated on a new instruction-based benchmark.
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions cs.RO · 2025-09-08 · unverdicted · none · ref 9 · internal anchor
F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.
Semantic-guided Gaussian Splatting for High-Fidelity Underwater Scene Reconstruction cs.CV · 2025-08-31 · conditional · none · ref 1 · internal anchor
SWAGSplatting augments 3D Gaussian Splatting with CLIP-derived semantic features, a semantic consistency loss, and adaptive primitive reallocation to achieve higher-fidelity reconstruction in low-visibility underwater scenes.
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model cs.LG · 2025-05-29 · unverdicted · none · ref 4 · internal anchor
Muddit is a unified discrete diffusion transformer that integrates strong visual priors from a pretrained text-to-image model with a lightweight text decoder to enable fast parallel generation across text and image modalities.
Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing cs.CV · 2025-09-02 · unverdicted · none · ref 2 · internal anchor
Rebalancing designer-painter roles by assigning design to the understanding module via the new DIM dataset yields SOTA image editing performance with a 4.6B model.
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning cs.CV · 2025-08-28 · unverdicted · none · ref 3 · internal anchor
Pref-GRPO stabilizes T2I RL training by using pairwise win rates from preference models as rewards instead of normalized pointwise scores, while UniGenBench enables finer-grained model evaluation across themes and criteria.
OmniGen2: Towards Instruction-Aligned Multimodal Generation cs.CV · 2025-06-23 · unverdicted · none · ref 7 · internal anchor
OmniGen2 introduces a unified generative model with two distinct decoding pathways and a decoupled image tokenizer that achieves competitive results on text-to-image and editing benchmarks plus state-of-the-art consistency among open-source models on the new OmniContext benchmark.
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation cs.CV · 2025-06-03 · unverdicted · none · ref 4 · internal anchor
UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
Qwen-Image Technical Report cs.CV · 2025-08-04 · unverdicted · none · ref 6
Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive encoding for editing consistency.
Show-o2: Improved Native Unified Multimodal Models cs.CV · 2025-06-18 · unverdicted · none · ref 18 · internal anchor
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation cs.CV · 2025-03-10 · unreviewed · ref 2 · internal anchor
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer cs.CV · 2025-11-27 · unreviewed · ref 11

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer