hub Mixed citations

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han · 2025 · cs.LG · arXiv 2503.09573

Mixed citation behavior. Most common role is background (60%).

37 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 37 citing papers arXiv PDF

abstract

Diffusion language models offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work, we introduce a class of block diffusion language models that interpolate between discrete denoising diffusion and autoregressive models. Block diffusion overcomes key limitations of both approaches by supporting flexible-length generation and improving inference efficiency with KV caching and parallel token sampling. We propose a recipe for building effective block diffusion models that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Block diffusion sets a new state-of-the-art performance among diffusion models on language modeling benchmarks and enables generation of arbitrary-length sequences. We provide the code, along with the model weights and blog post on the project page: https://m-arriola.com/bd3lms

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 method 3 baseline 1

citation-polarity summary

background 6 use method 3 baseline 1

representative citing papers

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

cs.LG · 2026-03-13 · unverdicted · novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.

NPU Design for Diffusion Language Model Inference

cs.AR · 2026-01-28 · unverdicted · novelty 8.0

Introduces the first NPU accelerator for diffusion language models with dLLM-specific ISA, hardware execution model, BAOS KV quantization, and 7nm RTL synthesis.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Continuous Language Diffusion as a Decoder-Interface Problem

cs.CL · 2026-06-07 · unverdicted · novelty 7.0

Continuous language diffusion works by entering high-margin decoder basins where frozen T5 embeddings recover 93-96% of native decisions and linear readouts reach 97.9% agreement, implying models should be evaluated as representation-decoder systems.

Dynamic Chunking for Diffusion Language Models

cs.CL · 2026-05-15 · unverdicted · novelty 7.0

DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.

Support Before Frequency in Discrete Diffusion

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Discrete diffusion models learn data support before frequencies because the exact reverse process decomposes edits into a dominant validity scale and a finer probability coefficient.

DARE: Diffusion Language Model Activation Reuse for Efficient Inference

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

DARE reuses up to 87% of attention activations in diffusion LLMs through KV caching and output reuse, delivering 1.2x per-layer latency gains with average performance drops of 1.2-2.0%.

NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.

One Pass for All: A Discrete Diffusion Model for Knowledge Graph Triple Set Prediction

cs.AI · 2026-04-20 · unverdicted · novelty 7.0

DiffTSP applies discrete diffusion to knowledge graph triple set prediction, recovering all missing triples simultaneously via edge-masking noise reversal and a structure-aware transformer, achieving SOTA on three datasets.

BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

cs.CL · 2026-04-13 · unverdicted · novelty 7.0

LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.

DMax: Aggressive Parallel Decoding for dLLMs

cs.LG · 2026-04-09 · conditional · novelty 7.0 · 2 refs

DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.

Discrete Stochastic Localization for Non-autoregressive Generation

cs.LG · 2026-02-18 · unverdicted · novelty 7.0

Discrete Stochastic Localization lets a single trained network support an entire family of per-token SNR paths for discrete sequence generation, with masked diffusion as a special case, and improves MAUVE scores when fine-tuning pretrained checkpoints.

Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models

cs.LG · 2026-02-04 · unverdicted · novelty 7.0

Early and late denoising steps in masked diffusion LMs are robust to smaller-model replacement, enabling 17% FLOPs reduction with modest generative quality loss.

PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion

cs.CV · 2025-11-24 · unverdicted · novelty 7.0

PartDiffuser is a semi-autoregressive discrete diffusion framework that generates high-fidelity 3D meshes from point clouds by combining inter-part autoregression with intra-part parallel diffusion using a part-aware DiT architecture.

AMix-2: Establishing Protein as a Native Modality in Large Language Models

q-bio.BM · 2026-05-29 · unverdicted · novelty 6.0

AMix-2 unifies protein sequences and text in one LLM via shared tokens and block-wise diffusion modeling, introduces the ProteinArena benchmark, and reports competitive performance against task-specific protein models and frontier LLMs.

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Orthrus unifies autoregressive LLMs and diffusion models via shared KV cache and consensus to enable up to 7.8x parallel token generation speedup with O(1) memory overhead and lossless results.

BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.

FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation

cs.CV · 2026-05-10 · unverdicted · novelty 6.0 · 2 refs

FlashAR accelerates autoregressive image generation up to 22.9x by post-training a pre-trained raster-scan model with a complementary vertical head and dynamic fusion for two-way next-token prediction.

TextLDM: Language Modeling with Continuous Latent Diffusion

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

TextLDM applies DiT-style latent diffusion with flow matching to language modeling via a REPA-aligned VAE, outperforming prior diffusion LMs and matching GPT-2 when trained from scratch on OpenWebText2.

ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

cs.RO · 2026-05-06 · unverdicted · novelty 6.0 · 2 refs

ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.

Stability-Weighted Decoding for Diffusion Language Models

cs.CL · 2026-04-18 · unverdicted · novelty 6.0

Stability-Weighted Decoding improves diffusion LLM accuracy by modulating token scores with temporal stability from KL divergence between prediction steps.

Differences in Text Generated by Diffusion and Autoregressive Language Models

cs.CL · 2026-04-04 · unverdicted · novelty 6.0

DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.

Generative Frontiers: Why Evaluation Matters for Diffusion Language Models

cs.LG · 2026-04-03 · conditional · novelty 6.0

Generative perplexity and entropy are shown to be the two additive components of KL divergence to a reference distribution, motivating generative frontiers as a principled evaluation method for diffusion language models.

citing papers explorer

Showing 37 of 37 citing papers.

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages cs.LG · 2026-03-13 · unverdicted · none · ref 1 · internal anchor
Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
NPU Design for Diffusion Language Model Inference cs.AR · 2026-01-28 · unverdicted · none · ref 13 · internal anchor
Introduces the first NPU accelerator for diffusion language models with dLLM-specific ISA, hardware execution model, BAOS KV quantization, and 7nm RTL synthesis.
Large Language Diffusion Models cs.CL · 2025-02-14 · unverdicted · none · ref 32 · internal anchor
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Continuous Language Diffusion as a Decoder-Interface Problem cs.CL · 2026-06-07 · unverdicted · none · ref 2 · internal anchor
Continuous language diffusion works by entering high-margin decoder basins where frozen T5 embeddings recover 93-96% of native decisions and linear readouts reach 97.9% agreement, implying models should be evaluated as representation-decoder systems.
Dynamic Chunking for Diffusion Language Models cs.CL · 2026-05-15 · unverdicted · none · ref 3 · internal anchor
DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.
Support Before Frequency in Discrete Diffusion cs.LG · 2026-05-13 · unverdicted · none · ref 29 · internal anchor
Discrete diffusion models learn data support before frequencies because the exact reverse process decomposes edits into a dominant validity scale and a finer probability coefficient.
DARE: Diffusion Language Model Activation Reuse for Efficient Inference cs.LG · 2026-05-01 · unverdicted · none · ref 11 · internal anchor
DARE reuses up to 87% of attention activations in diffusion LLMs through KV caching and output reuse, delivering 1.2x per-layer latency gains with average performance drops of 1.2-2.0%.
NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization cs.LG · 2026-04-20 · unverdicted · none · ref 19 · internal anchor
NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
One Pass for All: A Discrete Diffusion Model for Knowledge Graph Triple Set Prediction cs.AI · 2026-04-20 · unverdicted · none · ref 22 · internal anchor
DiffTSP applies discrete diffusion to knowledge graph triple set prediction, recovering all missing triples simultaneously via edge-masking noise reversal and a structure-aware transformer, achieving SOTA on three datasets.
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation cs.CV · 2026-04-15 · unverdicted · none · ref 2 · internal anchor
BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling cs.CL · 2026-04-13 · unverdicted · none · ref 1 · internal anchor
LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.
DMax: Aggressive Parallel Decoding for dLLMs cs.LG · 2026-04-09 · conditional · none · ref 3 · 2 links · internal anchor
DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
Discrete Stochastic Localization for Non-autoregressive Generation cs.LG · 2026-02-18 · unverdicted · none · ref 1 · internal anchor
Discrete Stochastic Localization lets a single trained network support an entire family of per-token SNR paths for discrete sequence generation, with masked diffusion as a special case, and improves MAUVE scores when fine-tuning pretrained checkpoints.
Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models cs.LG · 2026-02-04 · unverdicted · none · ref 1 · internal anchor
Early and late denoising steps in masked diffusion LMs are robust to smaller-model replacement, enabling 17% FLOPs reduction with modest generative quality loss.
PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion cs.CV · 2025-11-24 · unverdicted · none · ref 1 · internal anchor
PartDiffuser is a semi-autoregressive discrete diffusion framework that generates high-fidelity 3D meshes from point clouds by combining inter-part autoregression with intra-part parallel diffusion using a part-aware DiT architecture.
AMix-2: Establishing Protein as a Native Modality in Large Language Models q-bio.BM · 2026-05-29 · unverdicted · none · ref 28 · internal anchor
AMix-2 unifies protein sequences and text in one LLM via shared tokens and block-wise diffusion modeling, introduces the ProteinArena benchmark, and reports competitive performance against task-specific protein models and frontier LLMs.
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion cs.LG · 2026-05-12 · unverdicted · none · ref 2 · 2 links · internal anchor
Orthrus unifies autoregressive LLMs and diffusion models via shared KV cache and consensus to enable up to 7.8x parallel token generation speedup with O(1) memory overhead and lossless results.
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion cs.CL · 2026-05-12 · unverdicted · none · ref 2 · internal anchor
BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.
FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation cs.CV · 2026-05-10 · unverdicted · none · ref 1 · 2 links · internal anchor
FlashAR accelerates autoregressive image generation up to 22.9x by post-training a pre-trained raster-scan model with a complementary vertical head and dynamic fusion for two-way next-token prediction.
TextLDM: Language Modeling with Continuous Latent Diffusion cs.CL · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
TextLDM applies DiT-style latent diffusion with flow matching to language modeling via a REPA-aligned VAE, outperforming prior diffusion LMs and matching GPT-2 when trained from scratch on OpenWebText2.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving cs.RO · 2026-05-06 · unverdicted · none · ref 87 · 2 links · internal anchor
ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
Stability-Weighted Decoding for Diffusion Language Models cs.CL · 2026-04-18 · unverdicted · none · ref 1 · internal anchor
Stability-Weighted Decoding improves diffusion LLM accuracy by modulating token scores with temporal stability from KL divergence between prediction steps.
Differences in Text Generated by Diffusion and Autoregressive Language Models cs.CL · 2026-04-04 · unverdicted · none · ref 2 · internal anchor
DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.
Generative Frontiers: Why Evaluation Matters for Diffusion Language Models cs.LG · 2026-04-03 · conditional · none · ref 2 · internal anchor
Generative perplexity and entropy are shown to be the two additive components of KL divergence to a reference distribution, motivating generative frontiers as a principled evaluation method for diffusion language models.
Training-Trajectory-Aware Token Selection cs.CL · 2026-01-15 · unverdicted · none · ref 1 · internal anchor
Training-Trajectory-Aware Token Selection (T3S) reconstructs the token-level training objective to overcome a performance bottleneck in continual distillation of reasoning capabilities from large to small language models.
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed cs.CL · 2025-12-16 · unverdicted · none · ref 5 · internal anchor
Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.
LLaDA2.0: Scaling Up Diffusion Language Models to 100B cs.LG · 2025-12-10 · conditional · none · ref 1 · internal anchor
LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces cs.LG · 2025-09-26 · unverdicted · none · ref 2 · internal anchor
A method trains discrete diffusion policies for combinatorial RL by matching to a PMD-regularized target distribution, reporting SOTA performance and sample efficiency on DNA generation, macro-action, and multi-agent benchmarks.
Diffusion Language Models Know the Answer Before Decoding cs.CL · 2025-08-27 · conditional · none · ref 2 · internal anchor
DLMs show early answer convergence allowing Prophet to cut decoding steps by up to 3.4x on LLaDA-8B and Dream-7B while keeping output quality.
Dream 7B: Diffusion Large Language Models cs.CL · 2025-08-21 · unverdicted · none · ref 2 · internal anchor
Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and quality-speed tradeoffs.
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference cs.CL · 2025-08-04 · unverdicted · none · ref 26 · internal anchor
Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.
Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance cs.RO · 2026-03-26 · unverdicted · none · ref 1 · internal anchor
Parameter differences from two training runs on a small task set are treated as auxiliary capability vectors that are merged into a pretrained VLA model, yielding auxiliary-task gains at the cost of ordinary supervised finetuning plus a simple regularization term.
The Serial Scaling Hypothesis cs.LG · 2025-07-16 · unverdicted · none · ref 4 · internal anchor
The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.
A Survey of Reinforcement Learning for Large Reasoning Models cs.CL · 2025-09-10 · accept · none · ref 23 · internal anchor
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving cs.CL · 2026-05-22 · unreviewed · ref 1 · internal anchor
Learned Relay Representations for Forward-Thinking Discrete Diffusion Models cs.LG · 2026-05-21 · unreviewed · ref 9 · internal anchor
Attention-Based Sampler for Diffusion Language Models cs.CL · 2026-03-18 · unreviewed · ref 2 · internal anchor

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer