hub Baseline reference

Ovis-u1 technical report

Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, et al · 2025 · arXiv 2506.23044

Baseline reference. 67% of citing Pith papers use this work as a benchmark or comparison.

12 Pith papers citing it

Baseline 67% of classified citations

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 4 background 2

citation-polarity summary

baseline 4 background 2

representative citing papers

UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

cs.MM · 2026-05-12 · unverdicted · novelty 7.0

UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixed strategies.

Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.

PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

cs.CV · 2026-02-06 · unverdicted · novelty 7.0

PlanViz is a new benchmark with three sub-tasks and PlanScore metric to evaluate planning-oriented image generation and editing by unified multimodal models for computer-use tasks.

ProductWebGen: Benchmarking Multimodal Product Webpage Generation

cs.CV · 2026-05-31 · unverdicted · novelty 6.0

Introduces ProductWebGen benchmark for multimodal product webpage generation, compares editing-based vs unified-model workflows on 500 samples, and releases ProductWebGen-1k SFT dataset.

Lance: Unified Multimodal Modeling by Multi-Task Synergy

cs.CV · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

cs.CV · 2026-04-27 · unverdicted · novelty 6.0 · 2 refs

Tuna-2 shows that direct pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive generation and stronger understanding at scale.

Training-Free Image Editing with Visual Context Integration and Concept Alignment

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

VicoEdit performs training-free image editing by transforming source images directly with visual context and concept-alignment-guided posterior sampling, outperforming training-based methods.

InfoTok: Information-Theoretic Regularization for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs

cs.LG · 2026-02-02 · unverdicted · novelty 6.0

InfoTok uses mutual information constraints to regularize shared visual tokenization in unified MLLMs, improving both understanding and generation performance without extra training data.

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

cs.CV · 2026-05-12 · unverdicted · novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

UniCSG: Unified High-Fidelity Content-Constrained Style-Driven Generation via Staged Semantic and Frequency Disentanglement

cs.CV · 2026-04-20 · unverdicted · novelty 5.0

UniCSG adds staged semantic disentanglement and frequency-aware reconstruction to DiT diffusion models to improve content preservation and style fidelity in both text- and reference-guided generation.

OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization

cs.CV · 2026-02-05 · unverdicted · novelty 4.0

OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.

UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

cs.CV · 2026-04-19 · 2 refs

citing papers explorer

Showing 12 of 12 citing papers.

UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning cs.MM · 2026-05-12 · unverdicted · none · ref 19
UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixed strategies.
Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models cs.CV · 2026-04-13 · unverdicted · none · ref 62
Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.
PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks cs.CV · 2026-02-06 · unverdicted · none · ref 41
PlanViz is a new benchmark with three sub-tasks and PlanScore metric to evaluate planning-oriented image generation and editing by unified multimodal models for computer-use tasks.
ProductWebGen: Benchmarking Multimodal Product Webpage Generation cs.CV · 2026-05-31 · unverdicted · none · ref 29
Introduces ProductWebGen benchmark for multimodal product webpage generation, compares editing-based vs unified-model workflows on 500 samples, and releases ProductWebGen-1k SFT dataset.
Lance: Unified Multimodal Modeling by Multi-Task Synergy cs.CV · 2026-05-18 · unverdicted · none · ref 112 · 2 links
Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation cs.CV · 2026-04-27 · unverdicted · none · ref 41 · 2 links
Tuna-2 shows that direct pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive generation and stronger understanding at scale.
Training-Free Image Editing with Visual Context Integration and Concept Alignment cs.CV · 2026-04-06 · unverdicted · none · ref 9
VicoEdit performs training-free image editing by transforming source images directly with visual context and concept-alignment-guided posterior sampling, outperforming training-based methods.
InfoTok: Information-Theoretic Regularization for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs cs.LG · 2026-02-02 · unverdicted · none · ref 60
InfoTok uses mutual information constraints to regularize shared visual tokenization in unified MLLMs, improving both understanding and generation performance without extra training data.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture cs.CV · 2026-05-12 · unverdicted · none · ref 130
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
UniCSG: Unified High-Fidelity Content-Constrained Style-Driven Generation via Staged Semantic and Frequency Disentanglement cs.CV · 2026-04-20 · unverdicted · none · ref 32
UniCSG adds staged semantic disentanglement and frequency-aware reconstruction to DiT diffusion models to improve content preservation and style fidelity in both text- and reference-guided generation.
OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization cs.CV · 2026-02-05 · unverdicted · none · ref 79
OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models cs.CV · 2026-04-19 · unreviewed · ref 68 · 2 links

Ovis-u1 technical report

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer