Omni-diffusion: Unified multimodal understanding and generation with masked discrete diffusion

Lijiang Li, Zuwei Long, Yunhang Shen, Heting Gao, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He, Chaoyou Fu · 2026 · arXiv 2603.06577

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

ELF: Embedded Language Flows

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

ELF is a continuous embedding-space flow matching model for language that stays continuous until the last step and outperforms prior discrete and continuous diffusion language models with fewer sampling steps.

Omnimodal Dataset Distillation via High-order Proxy Alignment

cs.CV · 2026-04-12 · unverdicted · novelty 6.0

HoPA captures high-order cross-modal alignments via a shared proxy to enable scalable omnimodal dataset distillation with better performance-compression trade-offs.

Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

Uni-ViGU unifies video generation and understanding by extending a diffusion video generator with unified continuous-discrete flow matching, modality-driven MoE layers, and bidirectional training stages that repurpose generative knowledge for discriminative tasks.

MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset

cs.CV · 2026-05-20 · unverdicted · novelty 5.0

MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.

citing papers explorer

Showing 4 of 4 citing papers.

ELF: Embedded Language Flows cs.CL · 2026-05-11 · unverdicted · none · ref 31
ELF is a continuous embedding-space flow matching model for language that stays continuous until the last step and outperforms prior discrete and continuous diffusion language models with fewer sampling steps.
Omnimodal Dataset Distillation via High-order Proxy Alignment cs.CV · 2026-04-12 · unverdicted · none · ref 65
HoPA captures high-order cross-modal alignments via a shared proxy to enable scalable omnimodal dataset distillation with better performance-compression trade-offs.
Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator cs.CV · 2026-04-09 · unverdicted · none · ref 45
Uni-ViGU unifies video generation and understanding by extending a diffusion video generator with unified continuous-discrete flow matching, modality-driven MoE layers, and bidirectional training stages that repurpose generative knowledge for discriminative tasks.
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset cs.CV · 2026-05-20 · unverdicted · none · ref 56
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.

Omni-diffusion: Unified multimodal understanding and generation with masked discrete diffusion

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer