Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
hub Mixed citations
MMaDA: Multimodal Large Diffusion Language Models
Mixed citation behavior. Most common role is background (64%).
abstract
We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA
hub tools
citation-role summary
citation-polarity summary
representative citing papers
InterleaveThinker is the first multi-agent pipeline enabling interleaved generation in any image generator through planner-critic agents, SFT on custom datasets, and GRPO RL with accuracy and step-wise rewards.
TimeROME-DLM enables training-free knowledge editing in masked diffusion language models via temporal causal tracing and low-rank residual edit memory applied at inference time.
MaskForge reaches 79.3% average attack success rate on five dLLMs by adaptively searching and accumulating structural attack patterns with a UCB bandit, improving 17.6% over baselines and transferring to 88.2% on AdvBench.
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixed strategies.
RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
ΔLPS is a gradient-guided discrete posterior sampler for inverse problems that works with masked or uniform discrete diffusion priors and outperforms prior discrete methods on image restoration tasks.
BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
Info-Gain Sampler improves MDM decoding by using bidirectional information gain to reduce cumulative uncertainty, outperforming greedy samplers on reasoning accuracy and creative writing tasks.
dMLLM-TTS delivers up to 6x more efficient test-time scaling for diffusion MLLMs via O(N+T) hierarchical search and self-verified feedback, improving generation quality on GenEval across three models.
Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.
A masked discrete diffusion model adds token editing at inference and grouped cross-entropy training to reach 0.90 GenEval, 86.9 DPG, and 10.76 HPSv3 scores.
DiPOD stabilizes diffusion policy optimization by interleaving self-distillation with gradient updates via an on-policy ELBO regularizer, yielding more stable training and higher rewards than prior methods.
PAPO improves reasoning performance in diffusion LLMs by converting sparse terminal rewards into dense step-wise credit and replaying real high-uncertainty trajectories, reporting gains up to 42.2% on Countdown.
SimSD adds a masking strategy to enable speculative decoding in diffusion LLMs, delivering up to 7.46x throughput gains on SDAR models while preserving generation quality.
Lumos-Nexus is a training-efficient video generation framework using two-stage alignment of a lightweight model followed by progressive frequency bridging to a high-fidelity generator in homogeneous latent space, plus the new VR-Bench for reasoning evaluation.
dMoE aggregates token expert distributions to block level in dLLMs, cutting unique experts from 69.5 to 14.6, memory by 76-80%, and latency by 1.14-1.66x while retaining 99.11% performance.
GDSD reduces RL for dLLMs to likelihood-free self-distillation via a normalization-free logit-matching objective, outperforming ELBO methods with more stable training on LLaDA-8B and Dream-7B.
GCPO performs per-token credit assignment in discrete policy optimization by setting token advantages proportional to the difference in model predictions under positive versus negative prompts, outperforming GRPO and DAPO on text-to-image and chain-of-thought tasks.
VRCD prioritizes visually complementary positions during parallel decoding in dMLLMs by measuring attention overlap with the new Visual Redundancy Index, yielding accuracy gains over confidence-based baselines on M^3CoT and MMBench.
Fast-dDrive is a block-diffusion VLA that reports SOTA accuracy on WOD-E2E and nuScenes driving benchmarks together with 12x throughput over autoregressive baselines via section scaffolds and test-time averaging.
Diffusion LLMs can act as their own efficiency teachers by using revokable parallel decoding to identify reliable token orders and then distilling those orders into the model parameters for faster inference.
citing papers explorer
-
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
-
Discrete Langevin-Inspired Posterior Sampling
ΔLPS is a gradient-guided discrete posterior sampler for inverse problems that works with masked or uniform discrete diffusion priors and outperforms prior discrete methods on image restoration tasks.
-
DMax: Aggressive Parallel Decoding for dLLMs
DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
-
NoiseRater: Meta-Learned Noise Valuation for Diffusion Model Training
NoiseRater meta-learns instance-level importance scores for noise in diffusion training via bilevel optimization, then uses a two-stage pipeline to improve efficiency and generation quality on FFHQ and ImageNet.
-
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
-
UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection
UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.
-
Motus: A Unified Latent Action World Model
Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.