hub Mixed citations

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song · 2024 · cs.CV · arXiv 2404.14396

Mixed citation behavior. Most common role is background (56%).

39 Pith papers citing it

Background 56% of classified citations

open full Pith review browse 39 citing papers arXiv PDF

abstract

The rapid evolution of multimodal foundation model has demonstrated significant progresses in vision-language understanding and generation, e.g., our previous work SEED-LLaMA. However, there remains a gap between its capability and the real-world applicability, primarily due to the model's limited capacity to effectively respond to various user instructions and interact with diverse visual data. In this work, we focus on bridging this gap through integrating two enhanced features: (1) comprehending images of arbitrary sizes and ratios, and (2) enabling multi-granularity image generation. We present a unified and versatile foundation model, namely, SEED-X, which is able to model multi-granularity visual semantics for comprehension and generation tasks. Besides the competitive results on public benchmarks, SEED-X demonstrates its effectiveness in handling real-world applications across various domains after instruction tuning. We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications. The models, codes, and datasets are released in https://github.com/AILab-CVC/SEED-X.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 baseline 6 dataset 2

citation-polarity summary

background 10 baseline 6 use dataset 2

representative citing papers

Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models

cs.CR · 2026-05-19 · conditional · novelty 7.0

ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.

A Unified and Controllable Framework for Layered Image Generation with Visual Effects

cs.CV · 2026-01-21 · unverdicted · novelty 7.0

LASAGNA produces layered images with integrated visual effects in a single pass, enabling drift-free edits via alpha compositing while releasing a 48K dataset and a 242-sample benchmark.

Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation

cs.CV · 2026-01-04 · unverdicted · novelty 7.0

GAR-Font is a global-aware autoregressive framework for multimodal few-shot font generation that adds global tokenization, a language-style adapter, and post-refinement to improve style coherence over patch-based methods.

Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching

cs.LG · 2025-09-26 · conditional · novelty 7.0

Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.

Transfer between Modalities with MetaQueries

cs.CV · 2025-04-08 · unverdicted · novelty 7.0

MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

cs.CV · 2025-03-18 · unverdicted · novelty 7.0

DualToken disentangles semantics and appearance via separate codebooks in one tokenizer, reporting 0.25 rFID, 82% ImageNet zero-shot accuracy, and gains over VILA-U on understanding and generation benchmarks.

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

cs.CV · 2024-10-17 · unverdicted · novelty 7.0

Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.

Lance: Unified Multimodal Modeling by Multi-Task Synergy

cs.CV · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.

Latent Action Control for Reasoning-Guided Unified Image Generation

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

Latent Action Control learns unobserved action trajectories via variational alignment and GRPO to inject reasoning into flow-based image generation, yielding gains on compositional benchmarks.

UAM: A Dual-Stream Perspective on Forgetting in VLA Training

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation tasks including OOD cases.

MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.

Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

cs.CV · 2026-04-28 · unverdicted · novelty 6.0

Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than editing-based methods.

Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

cs.CV · 2026-04-23 · unverdicted · novelty 6.0

IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.

Multimodal Large Language Models for Multi-Subject In-Context Image Generation

cs.LG · 2026-04-08 · unverdicted · novelty 6.0

MUSIC is the first MLLM for multi-subject in-context image generation that uses an automatic data pipeline, vision chain-of-thought reasoning, and semantics-driven spatial layout planning to outperform prior methods on a new MSIC benchmark.

PhotoFramer: Multi-modal Image Composition Instruction

cs.CV · 2025-11-30 · conditional · novelty 6.0

PhotoFramer is a multi-modal model that jointly produces textual composition instructions and illustrative corrected images from poorly framed inputs.

Emu3.5: Native Multimodal Models are World Learners

cs.CV · 2025-10-30 · unverdicted · novelty 6.0

Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

cs.LG · 2025-05-29 · unverdicted · novelty 6.0

Muddit is a unified discrete diffusion transformer that integrates strong visual priors from a pretrained text-to-image model with a lightweight text decoder to enable fast parallel generation across text and image modalities.

MMaDA: Multimodal Large Diffusion Language Models

cs.CV · 2025-05-21 · unverdicted · novelty 6.0

MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-image tasks.

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

cs.CV · 2025-05-08 · unverdicted · novelty 6.0

Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

cs.CV · 2025-03-13 · unverdicted · novelty 6.0

HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

cs.CL · 2025-03-10 · unverdicted · novelty 6.0

A two-stage RL framework first boosts text reasoning in 3B LMMs then adapts it to multimodal inputs, producing modest benchmark gains of 4.5-4.8%.

Emu3: Next-Token Prediction is All You Need

cs.CV · 2024-09-27 · unverdicted · novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

Bernini: Latent Semantic Planning for Video Diffusion

cs.CV · 2026-05-21 · unverdicted · novelty 5.0

Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.

WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

WinTok is a hybrid visual tokenizer that supplements pixel tokens with learnable semantic tokens distilled asymmetrically from foundation models to improve reconstruction, understanding, and generation.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling cs.AI · 2025-01-29 · conditional · none · ref 13 · internal anchor
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer