hub Canonical reference

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu · 2025 · cs.LG · arXiv 2505.16933

Canonical reference. 78% of citing Pith papers cite this work as background.

22 Pith papers citing it

Background 78% of classified citations

open full Pith review browse 22 citing papers arXiv PDF

abstract

In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: https://ml-gsai.github.io/LLaDA-V-demo/.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 baseline 1 method 1

citation-polarity summary

background 7 baseline 1 use method 1

representative citing papers

Relative Score Policy Optimization for Diffusion Language Models

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.

GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization

cs.CV · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

GPO-V jailbreaks dVLMs by globally optimizing probabilities in the denoising process to bypass refusal patterns, achieving stealthy and transferable attacks.

BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.

DMax: Aggressive Parallel Decoding for dLLMs

cs.LG · 2026-04-09 · conditional · novelty 7.0 · 2 refs

DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.

dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models

cs.CV · 2025-12-22 · conditional · novelty 7.0

dMLLM-TTS delivers up to 6x more efficient test-time scaling for diffusion MLLMs via O(N+T) hierarchical search and self-verified feedback, improving generation quality on GenEval across three models.

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

cs.CL · 2025-05-28 · conditional · novelty 7.0

Fast-dLLM adds reusable KV cache blocks and selective parallel decoding to diffusion LLMs, closing most of the speed gap with autoregressive models without retraining.

dMoE: dLLMs with Learnable Block Experts

cs.CL · 2026-05-29 · unverdicted · novelty 6.0

dMoE aggregates token expert distributions to block level in dLLMs, cutting unique experts from 69.5 to 14.6, memory by 76-80%, and latency by 1.14-1.66x while retaining 99.11% performance.

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

AnyMo is a masked-modeling framework for any-modality human motion generation trained on the new OmniHuMo dataset of 5,000+ hours of multimodal motion sequences.

Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

cs.AI · 2026-05-16 · unverdicted · novelty 6.0

Proposes HT-GRPO with sketch-then-paint staged updates, prompt-conditioned importance ratios, and hierarchical credit assignment for dMLLMs, reporting gains on GenEval and DPG plus quality metrics.

Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

cs.CV · 2026-05-14 · unverdicted · novelty 6.0 · 2 refs

Diagnoses mask prior drift and positional attention collapse in LDVLMs and introduces two plug-and-play decoding interventions that raise long-form generation quality without retraining.

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

cs.RO · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.

Continuous Latent Diffusion Language Model

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing latent prior modeling as an alternative to token-level autoregressive language model

One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

Stability-Weighted Decoding for Diffusion Language Models

cs.CL · 2026-04-18 · unverdicted · novelty 6.0

Stability-Weighted Decoding improves diffusion LLM accuracy by modulating token scores with temporal stability from KL divergence between prediction steps.

Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.

LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation

cs.SD · 2026-04-13 · unverdicted · novelty 6.0

LaDA-Band applies discrete masked diffusion with dual-track conditioning and progressive training to generate vocal-to-accompaniment tracks that improve acoustic authenticity, global coherence, and dynamic orchestration over prior baselines.

Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

cs.CL · 2025-12-16 · unverdicted · novelty 6.0

Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

cs.LG · 2026-04-10 · unverdicted · novelty 5.0 · 2 refs

ECHO introduces one-step block diffusion via Direct Conditional Distillation and Response-Asymmetric Diffusion to generate chest X-ray reports faster than autoregressive models while improving clinical metrics.

Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving

cs.CL · 2026-05-22

ELF: Embedded Language Flows

cs.CL · 2026-05-11

CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit

cs.CL · 2025-10-07

citing papers explorer

Showing 22 of 22 citing papers.

Relative Score Policy Optimization for Diffusion Language Models cs.CL · 2026-05-11 · unverdicted · none · ref 99 · internal anchor
RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization cs.CV · 2026-05-08 · unverdicted · none · ref 5 · 2 links · internal anchor
GPO-V jailbreaks dVLMs by globally optimizing probabilities in the denoising process to bypass refusal patterns, achieving stealthy and transferable attacks.
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation cs.CV · 2026-04-15 · unverdicted · none · ref 26 · internal anchor
BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
DMax: Aggressive Parallel Decoding for dLLMs cs.LG · 2026-04-09 · conditional · none · ref 94 · 2 links · internal anchor
DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models cs.CV · 2025-12-22 · conditional · none · ref 37 · internal anchor
dMLLM-TTS delivers up to 6x more efficient test-time scaling for diffusion MLLMs via O(N+T) hierarchical search and self-verified feedback, improving generation quality on GenEval across three models.
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding cs.CL · 2025-05-28 · conditional · none · ref 38 · internal anchor
Fast-dLLM adds reusable KV cache blocks and selective parallel decoding to diffusion LLMs, closing most of the speed gap with autoregressive models without retraining.
dMoE: dLLMs with Learnable Block Experts cs.CL · 2026-05-29 · unverdicted · none · ref 59 · internal anchor
dMoE aggregates token expert distributions to block level in dLLMs, cutting unique experts from 69.5 to 14.6, memory by 76-80%, and latency by 1.14-1.66x while retaining 99.11% performance.
AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling cs.CV · 2026-05-28 · unverdicted · none · ref 59 · internal anchor
AnyMo is a masked-modeling framework for any-modality human motion generation trained on the new OmniHuMo dataset of 5,000+ hours of multimodal motion sequences.
Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models cs.AI · 2026-05-16 · unverdicted · none · ref 44 · internal anchor
Proposes HT-GRPO with sketch-then-paint staged updates, prompt-conditioned importance ratios, and hierarchical credit assignment for dMLLMs, reporting gains on GenEval and DPG plus quality metrics.
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models cs.CV · 2026-05-14 · unverdicted · none · ref 23 · 2 links · internal anchor
Diagnoses mask prior drift and positional attention collapse in LDVLMs and introduces two plug-and-play decoding interventions that raise long-form generation quality without retraining.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving cs.RO · 2026-05-12 · unverdicted · none · ref 60 · 2 links · internal anchor
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
Continuous Latent Diffusion Language Model cs.CL · 2026-05-07 · unverdicted · none · ref 105 · internal anchor
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing latent prior modeling as an alternative to token-level autoregressive language model
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models cs.LG · 2026-04-20 · unverdicted · none · ref 173 · internal anchor
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
Stability-Weighted Decoding for Diffusion Language Models cs.CL · 2026-04-18 · unverdicted · none · ref 16 · internal anchor
Stability-Weighted Decoding improves diffusion LLM accuracy by modulating token scores with temporal stability from KL divergence between prediction steps.
Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models cs.LG · 2026-04-15 · unverdicted · none · ref 24 · internal anchor
Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.
LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation cs.SD · 2026-04-13 · unverdicted · none · ref 56 · internal anchor
LaDA-Band applies discrete masked diffusion with dual-track conditioning and progressive training to generate vocal-to-accompaniment tracks that improve acoustic authenticity, global coherence, and dynamic orchestration over prior baselines.
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models cs.AI · 2026-04-07 · unverdicted · none · ref 42 · internal anchor
Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed cs.CL · 2025-12-16 · unverdicted · none · ref 37 · internal anchor
Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion cs.LG · 2026-04-10 · unverdicted · none · ref 57 · 2 links · internal anchor
ECHO introduces one-step block diffusion via Direct Conditional Distillation and Response-Asymmetric Diffusion to generate chest X-ray reports faster than autoregressive models while improving clinical metrics.
Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving cs.CL · 2026-05-22 · unreviewed · ref 23 · internal anchor
ELF: Embedded Language Flows cs.CL · 2026-05-11 · unreviewed · ref 78 · internal anchor
CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit cs.CL · 2025-10-07 · unreviewed · ref 17 · internal anchor

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer