{"total":19,"items":[{"citing_arxiv_id":"2606.00724","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering","primary_cat":"cs.CL","submitted_at":"2026-05-30T13:32:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"WaveFilter applies wavelet decomposition to filter critical tokens for sparse KV caching, improving long-context performance of diffusion LLMs as a plug-and-play addition to existing methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20813","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-20T07:06:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PulseCol introduces periodically refreshed column-sparse attention to achieve up to 1.95x speedup over FlashAttention in diffusion LLMs with maintained model quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20179","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload","primary_cat":"cs.CL","submitted_at":"2026-05-19T17:59:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TIDE schedules I/O-aware expert offloading for MoE diffusion LLMs by solving for an optimal refresh interval that exploits temporal stability of activations, yielding up to 1.5x throughput gain losslessly.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19470","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Drifting Objectives for Refining Discrete Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-19T07:22:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TokenDrift refines discrete diffusion language models by applying anti-symmetric drifting to soft-token features during training, yielding large reductions in generation perplexity at low NFEs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18165","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs","primary_cat":"cs.LG","submitted_at":"2026-05-18T10:09:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Position-preserving MASK token compression reduces redundancy in diffusion LLMs to accelerate parallel decoding and enable context folding for longer sequences.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12825","ref_index":16,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion","primary_cat":"cs.LG","submitted_at":"2026-05-12T23:47:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Orthrus unifies autoregressive LLMs and diffusion models via shared KV cache and consensus to enable up to 7.8x parallel token generation speedup with O(1) memory overhead and lossless results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09999","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Muninn: Your Trajectory Diffusion Model But Faster","primary_cat":"cs.RO","submitted_at":"2026-05-11T05:21:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"chao Wang. Learning-to-cache: Accelerating diffusion transformer via layer caching. InNeural Information Processing Systems, volume 37, pages 133282-133304, 2024. [39] Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deep- cache: Accelerating diffusion models for free. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15762-15772, 2024. [40] Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. DKV-Cache: The cache for diffusion language models.arXiv preprint arXiv:2505.15781, 2025. [41] David Q Mayne, James B Rawlings, Christopher V Rao, and Pierre OM Scokaert. Constrained model predictive control: Stability and optimality.Automatica, 36(6):789- 814, 2000. [42] Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik"},{"citing_arxiv_id":"2605.09536","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM","primary_cat":"cs.CL","submitted_at":"2026-05-10T13:38:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Zhang, and Jun Xu. Dllm-searcher: Adapting diffusion large language model for search agents.arXiv preprint arXiv:2602.07035, 2026. [44] Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, and Linfeng Zhang. dllm-cache: Accelerating diffusion large language models with adaptive caching.arXiv preprint arXiv:2506.06295, 2025. [45] Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. dkv-cache: The cache for diffusion language models.arXiv preprint arXiv:2505.15781, 2025. [46] Yuchu Jiang, Yue Cai, Xiangzhong Luo, Jiale Fu, Jiarui Wang, Chonghan Liu, and Xu Yang. d2 cache: Accelerating diffusion-based llms via dual adaptive caching.arXiv preprint arXiv:2509.23094, 2025. [47] Jucheng Shen, Gaurav Sarkar, Yeonju Ro, Sharath Nittur Sridhar, Zhangyang Wang, Aditya Akella, and"},{"citing_arxiv_id":"2605.08134","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DARE: Diffusion Language Model Activation Reuse for Efficient Inference","primary_cat":"cs.LG","submitted_at":"2026-05-01T19:15:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DARE reuses up to 87% of attention activations in diffusion LLMs through KV caching and output reuse, delivering 1.2x per-layer latency gains with average performance drops of 1.2-2.0%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00161","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Consistent Diffusion Language Models","primary_cat":"cs.LG","submitted_at":"2026-04-30T19:31:02+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18995","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction","primary_cat":"cs.CL","submitted_at":"2026-04-21T02:26:08+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17068","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Stability-Weighted Decoding for Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-18T17:04:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Stability-Weighted Decoding improves diffusion LLM accuracy by modulating token scores with temporal stability from KL divergence between prediction steps.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"0 during the update xt+1 →x t equals the mutual information between the token and the updated contextx t, conditioned on the previous contextx t+1: Ext∼p(·|xt+1) h DKL \u0010 p(xi 0 |x t)∥p(x i 0 |x t+1) \u0011i =I(x i 0;x t |x t+1).(10) Proof.Expanding the definition of expected KL divergence: E[D(i) temp] = X xt p(xt |x t+1) X v∈V p(v|x t) log p(v|x t) p(v|x t+1) (11) = X xt X v p(v,x t |x t+1) log p(v|x t) p(v|x t+1) (12) Using Bayes' theorem, we substitutep(v|x t) = p(v,xt|xt+1) p(xt|xt+1) into the logarithm term: log p(v|x t) p(v|x t+1) = log p(v,x t |x t+1) p(xt |x t+1)p(v|x t+1) (13) Substituting this back into the summation: E[D(i) temp] = X xt X v p(v,x t |x t+1) log p(v,x t |x t+1) p(v|x t+1)p(xt |x t+1) (14)"},{"citing_arxiv_id":"2604.08302","ref_index":53,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DMax: Aggressive Parallel Decoding for dLLMs","primary_cat":"cs.LG","submitted_at":"2026-04-09T14:35:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"generation [87, 24, 21], long-context modeling [47, 28, 106], and agent [104, 102]. Accelerating Diffusion Language Models.dLLMs are viewed as promising due to their potential for low-cost inference, yet their efficiency remains largely underexplored. Existing efforts improve efficiency from several perspectives. Some methods reduce the cost of each decoding step through techniques including KV caching [53, 49, 84, 35, 45], token dropping [15, 36, 72, 85], and sparse attention [79, 19]. Others design more effective decoding strategies [37, 81, 41, 27, 8, 32, 34, 89, 50, 12, 77, 61, 22] to improve generation efficiency. A separate line of work [73, 62, 101, 7, 14, 33] learns better decoding trajectories so that fewer decoding steps are required. dParallel [16] employs certainty-"},{"citing_arxiv_id":"2604.05250","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DualDiffusion: A Speculative Decoding Strategy for Masked Diffusion Models","primary_cat":"cs.LG","submitted_at":"2026-04-06T23:23:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DualDiffusion combines a lightweight drafter using approximations with a full verifier to reduce generation steps in masked diffusion models while keeping accuracy on MMLU and GSM8K.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.07475","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Comparative analysis of Layer-wise Representational Capacity in AR and Diffusion LLMs","primary_cat":"cs.CL","submitted_at":"2026-03-08T05:31:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Diffusion language models form more global representations with early-layer redundancy compared to autoregressive models, allowing layer skipping for up to 18.75% FLOP savings while maintaining over 90% performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.14067","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed","primary_cat":"cs.CL","submitted_at":"2025-12-16T04:12:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.04525","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Demystifying MaskGIT Sampler and Beyond: Adaptive Order Selection in Masked Diffusion","primary_cat":"cs.LG","submitted_at":"2025-10-06T06:30:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Theoretical analysis reveals MaskGIT's implicit temperature sampling in masked diffusion; proposes equivalent moment sampler and efficiency techniques for adaptive unmasking with image and text experiments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.19982","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Diffusion Language Models Know the Answer Before Decoding","primary_cat":"cs.CL","submitted_at":"2025-08-27T15:40:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DLMs show early answer convergence allowing Prophet to cut decoding steps by up to 3.4x on LLaDA-8B and Dream-7B while keeping output quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.23606","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model","primary_cat":"cs.LG","submitted_at":"2025-05-29T16:15:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Muddit is a unified discrete diffusion transformer that integrates strong visual priors from a pretrained text-to-image model with a lightweight text decoder to enable fast parallel generation across text and image modalities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}