hub

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

URL https://openreview · 2025 · cs.LG · arXiv 2506.06295

26 Pith papers cite this work. Polarity classification is still indexing.

26 Pith papers citing it

open full Pith review browse 26 citing papers arXiv PDF

abstract

Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniques, such as Key-Value caching, are incompatible with dLLMs due to their bidirectional attention mechanism. To address this specific challenge, our work begins with a key observation that dLLM inference involves a static prompt and a partially dynamic response, where most tokens remain stable across adjacent denoising steps. Based on this, we propose dLLM-Cache, a training-free adaptive caching framework that combines long-interval prompt caching with partial response updates guided by feature similarity. This design enables efficient reuse of intermediate computations without compromising model performance. Extensive experiments on representative dLLMs, including LLaDA 8B and Dream 7B, show that dLLM-Cache achieves up to 9.1x FLOPs reduction on LongBench-HotpotQA while maintaining competitive output quality. Notably, our method brings dLLM inference latency close to that of ARMs under many settings. The code for this work is publicly available at: https://github.com/maomaocun/dLLM-cache.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Drifting Objectives for Refining Discrete Diffusion Language Models

cs.CL · 2026-05-19 · unverdicted · novelty 7.0

TokenDrift refines discrete diffusion language models by applying anti-symmetric drifting to soft-token features during training, yielding large reductions in generation perplexity at low NFEs.

BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

cs.RO · 2026-05-13 · unverdicted · novelty 7.0

BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.

TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.

NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.

DepCap: Adaptive Block-Wise Parallel Decoding for Efficient Diffusion LM Inference

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

DepCap accelerates diffusion LM inference up to 5.63x by using last-block influence for adaptive block boundaries and conflict-free token selection for parallel decoding, with negligible quality loss.

DMax: Aggressive Parallel Decoding for dLLMs

cs.LG · 2026-04-09 · conditional · novelty 7.0 · 2 refs

DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.

A Comparative analysis of Layer-wise Representational Capacity in AR and Diffusion LLMs

cs.CL · 2026-03-08 · unverdicted · novelty 7.0

Diffusion language models form more global representations with early-layer redundancy compared to autoregressive models, allowing layer skipping for up to 18.75% FLOP savings while maintaining over 90% performance.

Locally Coherent Parallel Decoding in Diffusion Language Models

cs.CL · 2026-03-03 · unverdicted · novelty 7.0

CoDiLA adds a compact auxiliary AR model on diffusion latents to enforce local sequential validity during parallel token sampling in discrete diffusion language models.

DiLaServe: High SLO Attainment Serving for Diffusion Language Models

cs.LG · 2026-06-27 · unverdicted · novelty 6.0

DiLaServe improves SLO attainment for diffusion language models by up to 56.6 percentage points and reduces latency by up to 46% with less than 1% accuracy drop via deadline-aware scheduling and dynamic reconfiguration.

HERALD: High-Throughput Block Diffusion LLM Serving via CPU-GPU Cooperative KV Cache Retrieval

cs.LG · 2026-06-19 · unverdicted · novelty 6.0

HERALD enables near-lossless accuracy at 5-10% KV budget for block dLLMs by amortizing top-k selection across denoising steps and overlapping CPU-GPU retrieval, yielding up to 2.47x higher throughput than GPU-only inference.

Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models

cs.CL · 2026-06-10 · unverdicted · novelty 6.0

AGDO improves dLLM reasoning performance by determining denoising order and emphasizing tokens based on attention-derived dependencies rather than random masking.

SimSD: Simple Speculative Decoding in Diffusion Language Models

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

SimSD adds a masking strategy to enable speculative decoding in diffusion LLMs, delivering up to 7.46x throughput gains on SDAR models while preserving generation quality.

Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation

cs.CL · 2026-05-29 · unverdicted · novelty 6.0

Introduces TSPD with a trajectory-feature controller and training-free CE to reduce denoising steps in dLLMs while aiming to preserve quality.

PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

PulseCol introduces periodically refreshed column-sparse attention to achieve up to 1.95x speedup over FlashAttention in diffusion LLMs with maintained model quality.

FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

FlexDraft is a lossless speculative decoding framework that adapts to batch sizes via attention tuning on final layers, MLP-based bonus calibration, and dynamic parallel/sequential decoding.

Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

Position-preserving MASK token compression reduces redundancy in diffusion LLMs to accelerate parallel decoding and enable context folding for longer sequences.

Roll Out and Roll Back: Diffusion LLMs are Their Own Efficiency Teachers

cs.CL · 2026-05-16 · unverdicted · novelty 6.0

Diffusion LLMs can act as their own efficiency teachers by using revokable parallel decoding to identify reliable token orders and then distilling those orders into the model parameters for faster inference.

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

cs.CL · 2026-04-21 · conditional · novelty 6.0

Voltage pulses create nonvolatile, multi-state metastable CDW phases in bulk EuTe4 at room temperature via out-of-plane phase switching in its moiré superstructure.

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

cs.CL · 2025-12-16 · unverdicted · novelty 6.0

Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.

Diffusion Language Models Know the Answer Before Decoding

cs.CL · 2025-08-27 · conditional · novelty 6.0

DLMs show early answer convergence allowing Prophet to cut decoding steps by up to 3.4x on LLaDA-8B and Dream-7B while keeping output quality.

Understanding Evaluation Illusion in Diffusion Large Language Models

cs.CL · 2026-06-28 · unverdicted · novelty 5.0 · 2 refs

Prompt template choice strongly affects apparent performance of parallel decoding methods in dLLMs, causing inconsistent rankings and illusory gains; current methods fail to beat single-token decoding.

Follow the Latent Roadmap: Navigating Revocable Decoding for Diffusion LLMs with Anchor Tokens

cs.CL · 2026-06-15 · unverdicted · novelty 5.0

ASRD is a training-free revocable decoding framework for diffusion LLMs that decouples context into trusted anchor tokens and uncertain candidates to improve accuracy by up to 6.4% and speed by up to 7.2x on math and coding benchmarks.

WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering

cs.CL · 2026-05-30 · unverdicted · novelty 5.0

WaveFilter applies wavelet decomposition to filter critical tokens for sparse KV caching, improving long-context performance of diffusion LLMs as a plug-and-play addition to existing methods.

Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM

cs.CL · 2026-05-27 · unverdicted · novelty 5.0

Dynamic-dLLM achieves over 3x average inference speedup on dLLMs like LLaDA-8B via adaptive cache budgets and decoding thresholds while preserving benchmark performance.

citing papers explorer

Showing 1 of 1 citing paper after filters.

BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning cs.RO · 2026-05-13 · unverdicted · none · ref 13 · internal anchor
BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer