hub Mixed citations

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, Mike Zheng Shou · 2025 · cs.CV · arXiv 2506.15564

Mixed citation behavior. Most common role is background (54%).

54 Pith papers citing it

Background 54% of classified citations

open full Pith review browse 54 citing papers arXiv PDF

abstract

This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13 baseline 8 method 2 other 1

citation-polarity summary

background 13 baseline 8 use method 2 unclear 1

representative citing papers

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

cs.AI · 2026-05-07 · unverdicted · novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.

Imagine Before You Draw: Visual Prompt Engineering for Image Generation

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

VPE inserts an internal autoregressive visual semantic token generation step to guide image token production in unified models, reporting faster convergence, higher quality, and superior editing preservation (PSNR 26.76 vs 19.92) versus external alternatives.

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.

Benchmarking and Enhancing Text-to-Image Models for Generating Visual Representations in Early Arithmetic Education

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

Introduces E2V-Bench benchmark for equation-to-visual generation in early arithmetic education, shows T2I models fail on numerical accuracy and relations, and reports partial gains from benchmark-guided enhancements.

MotionMERGE: A Multi-granular Framework for Human Motion Editing, Reasoning, Generation, and Explanation

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

MotionMERGE proposes a multi-granular LLM framework for fine-grained text-driven human motion editing, reasoning, generation, and explanation, supported by the new MotionFineEdit dataset with spatio-temporal annotations.

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Proposes V2V-Zero, a training-free framework replacing text conditioning with VLM final-layer hidden states from visual pages, achieving 0.85 on GenEval and 32.7/100 on new Simple-V2V Bench across models including video extension.

UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

cs.MM · 2026-05-12 · unverdicted · novelty 7.0

UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixed strategies.

What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.

Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation

cs.AI · 2026-05-01 · unverdicted · novelty 7.0

A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.

Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.

Exploring Spatial Intelligence from a Generative Perspective

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.

ATIR: Towards Audio-Text Interleaved Contextual Retrieval

cs.SD · 2026-04-22 · unverdicted · novelty 7.0

Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.

Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks

cs.CV · 2026-04-21 · unverdicted · novelty 7.0 · 2 refs

StepSTEM benchmark and dynamic-programming step alignment show top MLLMs achieve only 38.29% accuracy on graduate STEM tasks requiring interleaved cross-modal reasoning.

Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.

Latent Visual Reasoning

cs.CV · 2025-09-29 · unverdicted · novelty 7.0

Latent Visual Reasoning enables autoregressive generation of latent visual states that reconstruct critical image tokens, yielding gains on perception-heavy VQA benchmarks such as 71.67% on MMVP.

Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

A masked discrete diffusion model adds token editing at inference and grouped cross-entropy training to reach 0.90 GenEval, 86.9 DPG, and 10.76 HPSv3 scores.

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

cs.AI · 2026-06-02 · unverdicted · novelty 6.0

IPT supervision improves spatial reasoning in VLMs on perspective taking, path tracing, and multiview counting tasks, often outperforming textual chain-of-thought while remaining consistent with observed inputs.

ProductWebGen: Benchmarking Multimodal Product Webpage Generation

cs.CV · 2026-05-31 · unverdicted · novelty 6.0

Introduces ProductWebGen benchmark for multimodal product webpage generation, compares editing-based vs unified-model workflows on 500 samples, and releases ProductWebGen-1k SFT dataset.

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

Lumos-Nexus is a training-efficient video generation framework using two-stage alignment of a lightweight model followed by progressive frequency bridging to a high-fidelity generator in homogeneous latent space, plus the new VR-Bench for reasoning evaluation.

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

cs.CL · 2026-05-28 · unverdicted · novelty 6.0

Ptah is a multi-agent harness for generating verifiable interleaved multimodal reports via specialized agents and a verifier, with PtahEval for augmented assessment, showing better performance than baselines on deep research tasks.

Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

GCPO performs per-token credit assignment in discrete policy optimization by setting token advantages proportional to the difference in model predictions under positive versus negative prompts, outperforming GRPO and DAPO on text-to-image and chain-of-thought tasks.

Lance: Unified Multimodal Modeling by Multi-Task Synergy

cs.CV · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.

LatentUMM: Dual Latent Alignment for Unified Multimodal Models

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

LatentUMM proposes dual latent alignment at modality and capacity levels plus latent dynamics stabilization to reduce semantic drift and improve consistency in unified multimodal models.

citing papers explorer

Showing 11 of 11 citing papers after filters.

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation cs.CV · 2026-05-12 · unverdicted · none · ref 47 · internal anchor
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models cs.CV · 2026-04-27 · unverdicted · none · ref 38 · internal anchor
XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.
Exploring Spatial Intelligence from a Generative Perspective cs.CV · 2026-04-22 · unverdicted · none · ref 36 · internal anchor
Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models cs.CV · 2026-04-13 · unverdicted · none · ref 70 · internal anchor
Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping cs.CV · 2026-05-11 · unverdicted · none · ref 43 · internal anchor
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning cs.CV · 2026-05-11 · unverdicted · none · ref 12 · internal anchor
Sync-R1 applies cooperative RL with Sync-GRPO and Dynamic Group Scaling to achieve superior cross-task personalized reasoning in multimodal models on the new UnifyBench++ dataset.
Meta-CoT: Enhancing Granularity and Generalization in Image Editing cs.CV · 2026-04-27 · unverdicted · none · ref 67 · internal anchor
Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision cs.CV · 2026-05-07 · unverdicted · none · ref 66 · internal anchor
Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.
Motus: A Unified Latent Action World Model cs.CV · 2025-12-15 · unverdicted · none · ref 49 · internal anchor
Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.
Toward Native Multimodal Modeling: A Roadmap cs.CV · 2026-05-25 · unverdicted · none · ref 44 · internal anchor
A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.
Evolution of Video Generative Foundations cs.CV · 2026-04-07 · unverdicted · none · ref 152 · internal anchor
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Show-o2: Improved Native Unified Multimodal Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer