Recognition: 2 theorem links
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
Pith reviewed 2026-05-16 04:22 UTC · model grok-4.3
The pith
Diffusion LLMs can reach up to 27 times higher throughput by adding a reusable block-wise KV cache and decoding only high-confidence tokens in parallel.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A block-wise approximate KV cache mechanism tailored for bidirectional diffusion models enables cache reuse with negligible performance drop, while a confidence-aware parallel decoding strategy selectively decodes only tokens above a fixed threshold, thereby mitigating dependency violations and preserving generation quality.
What carries the argument
Block-wise approximate KV cache combined with a confidence threshold that controls which tokens are decoded in parallel
If this is right
- Throughput rises by as much as 27.6 times on existing Diffusion LLM checkpoints.
- Accuracy remains close to the base model on standard language benchmarks.
- The speed gap between diffusion and autoregressive models is largely closed.
- No retraining is required, so existing open-source Diffusion LLMs can be accelerated immediately.
Where Pith is reading between the lines
- The same block-wise cache pattern could be tested on other bidirectional sequence models outside language.
- Replacing the fixed threshold with a length-dependent or entropy-based rule might reduce the few remaining quality drops.
- Hardware kernels that exploit the block structure could push the speedup beyond the reported software numbers.
Load-bearing premise
The block-wise KV cache approximation introduces only negligible error and a single fixed threshold works across benchmarks without needing per-task retuning.
What would settle it
A direct comparison on a held-out long-sequence benchmark showing that the accelerated model either loses more than a few percent accuracy or that cache reuse causes measurable cumulative drift compared with full recomputation.
read the original abstract
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop. Additionally, we identify the root cause of generation quality degradation in parallel decoding as the disruption of token dependencies under the conditional independence assumption. To address this, we propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality. Experimental results on LLaDA and Dream models across multiple LLM benchmarks demonstrate up to \textbf{27.6$\times$ throughput} improvement with minimal accuracy loss, closing the performance gap with autoregressive models and paving the way for practical deployment of Diffusion LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Fast-dLLM, a training-free acceleration technique for diffusion-based LLMs. It proposes a block-wise approximate KV cache tailored to bidirectional diffusion attention to enable cache reuse, and a confidence-aware parallel decoding strategy that selectively decodes high-confidence tokens to avoid dependency violations under the conditional independence assumption. Experiments on LLaDA and Dream models across standard LLM benchmarks report up to 27.6× throughput improvement with minimal accuracy loss, narrowing the gap to autoregressive models.
Significance. If the central claims hold, the work would meaningfully advance practical deployment of diffusion LLMs by delivering substantial inference speedups without retraining, leveraging their inherent parallel decoding capability. The training-free design and reported empirical gains on multiple models and benchmarks constitute a concrete engineering contribution, though the absence of supporting analysis for the key approximations limits the strength of the significance assessment.
major comments (3)
- [Abstract and §3] Abstract and §3 (block-wise KV cache): the claim that the block-wise approximation enables cache reuse 'with negligible performance drop' is load-bearing for the throughput results, yet the manuscript provides no error-bound analysis, dependency-handling rule for future tokens in bidirectional attention, or quantitative characterization of the approximation error.
- [§4] §4 (confidence-aware parallel decoding): the strategy relies on a single fixed confidence threshold, but the exact selection rule is unspecified and no sensitivity analysis or cross-benchmark validation without per-task retuning is presented, leaving the 'minimal accuracy loss' claim vulnerable to benchmark-specific tuning.
- [Experiments] Experimental section: the reported 27.6× throughput figures rest on the two unverified conditions above; without ablation on block size, threshold sensitivity, or error metrics for the KV approximation, it is unclear whether the gains generalize or are tied to particular benchmark choices.
minor comments (2)
- [Notation and §4] Clarify notation for block size, confidence threshold, and the precise condition under which a token is decoded in parallel.
- [Figures] Add error bars or multiple-run statistics to throughput and accuracy plots to support the 'minimal loss' claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below with clarifications and proposed revisions to strengthen the manuscript's claims regarding the KV cache approximation and parallel decoding strategy.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (block-wise KV cache): the claim that the block-wise approximation enables cache reuse 'with negligible performance drop' is load-bearing for the throughput results, yet the manuscript provides no error-bound analysis, dependency-handling rule for future tokens in bidirectional attention, or quantitative characterization of the approximation error.
Authors: We agree that formal error-bound analysis and explicit dependency rules would strengthen the presentation. The block-wise approximation reuses cached keys and values for tokens within the same diffusion block while approximating cross-block interactions under the bidirectional attention pattern; future tokens are handled by a mask that prevents premature dependency violations during the denoising steps. Although we lack a closed-form error bound, the empirical results on LLaDA and Dream show accuracy drops below 1% on average across benchmarks. In revision we will add a dedicated subsection with quantitative error metrics (e.g., average attention-score deviation) and a clear statement of the dependency-handling rule. revision: partial
-
Referee: [§4] §4 (confidence-aware parallel decoding): the strategy relies on a single fixed confidence threshold, but the exact selection rule is unspecified and no sensitivity analysis or cross-benchmark validation without per-task retuning is presented, leaving the 'minimal accuracy loss' claim vulnerable to benchmark-specific tuning.
Authors: The threshold is fixed at a single value chosen on a validation split and applied uniformly; we will state this selection rule explicitly in the revised §4. We will also add a sensitivity study across thresholds on all reported benchmarks, confirming that accuracy remains stable without per-task retuning and thereby supporting the claim of minimal accuracy loss. revision: yes
-
Referee: [Experiments] Experimental section: the reported 27.6× throughput figures rest on the two unverified conditions above; without ablation on block size, threshold sensitivity, or error metrics for the KV approximation, it is unclear whether the gains generalize or are tied to particular benchmark choices.
Authors: We acknowledge that additional ablations would better demonstrate generalization. The current results already span two distinct diffusion LLMs and multiple standard benchmarks, but we will expand the experimental section with block-size ablations, threshold-sensitivity curves, and explicit KV-approximation error metrics to clarify that the reported speedups are not benchmark-specific. revision: yes
Circularity Check
No significant circularity in the paper's engineering methods
full rationale
The paper presents a training-free acceleration approach via block-wise approximate KV cache and confidence-aware parallel decoding for diffusion LLMs. These are described as practical mechanisms whose effectiveness is demonstrated empirically on LLaDA and Dream models across benchmarks, with reported throughput gains and minimal accuracy loss. No mathematical derivation chain exists that reduces a claimed prediction or result to a fitted parameter or self-defined quantity by construction. The approximations and threshold choice are validated through experiments rather than justified via self-citation load-bearing arguments or ansatz smuggling. The central claims rest on external benchmark comparisons, making the work self-contained against those benchmarks without internal circular reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- confidence threshold
axioms (1)
- domain assumption Bidirectional attention in diffusion models permits block-wise KV cache reuse with only small error
Forward citations
Cited by 22 Pith papers
-
NPU Design for Diffusion Language Model Inference
Introduces the first NPU accelerator for diffusion language models with dLLM-specific ISA, hardware execution model, BAOS KV quantization, and 7nm RTL synthesis.
-
Support Before Frequency in Discrete Diffusion
Discrete diffusion models learn data support before frequencies because the exact reverse process decomposes edits into a dominant validity scale and a finer probability coefficient.
-
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
-
LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection
LEAP detects early-converging tokens in dLLMs via future context filtering and multi-sequence superposition, reducing average denoising steps by about 30% while maintaining accuracy.
-
Fast Byte Latent Transformer
BLT-D, BLT-S, and BLT-DV use block-wise diffusion training and speculative verification to enable parallel byte generation in byte-level LMs, cutting memory-bandwidth cost by over 50%.
-
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
GPO-V jailbreaks dVLMs by globally optimizing probabilities in the denoising process to bypass refusal patterns, achieving stealthy and transferable attacks.
-
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
GPO-V is a visual jailbreak framework that bypasses safety guardrails in diffusion VLMs by globally manipulating generative probabilities during denoising.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
-
Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast
FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.
-
DARE: Diffusion Language Model Activation Reuse for Efficient Inference
DARE reuses up to 87% of attention activations in diffusion LLMs through KV caching and output reuse, delivering 1.2x per-layer latency gains with average performance drops of 1.2-2.0%.
-
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregr...
-
NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization
NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
-
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
ECHO is a one-step block diffusion VLM for chest X-ray reports that improves RaTE and SemScore by over 60% while delivering 8x faster inference than autoregressive baselines.
-
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
Continuous flow language models match discrete diffusion baselines and their distilled one-step flow map versions exceed 8-step discrete diffusion quality on LM1B and OWT.
-
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
TABOM models inference unmasking preferences as a Boltzmann distribution over predictive entropies and derives a ranking loss to align DLM training with observed trajectories, yielding gains in new domains and reduced...
-
How to Train Your Latent Diffusion Language Model Jointly With the Latent Space
Joint training of the latent space with the diffusion process produces a competitive latent diffusion language model that is faster than existing discrete and continuous diffusion baselines.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
-
Consistent Diffusion Language Models
CDLM trains denoisers to be path-invariant across stochastic posterior bridges in discrete diffusion, unifying prior methods and achieving new SOTA few-step text generation performance.
-
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
-
DualDiffusion: A Speculative Decoding Strategy for Masked Diffusion Models
DualDiffusion combines a lightweight drafter using approximations with a full verifier to reduce generation steps in masked diffusion models while keeping accuracy on MMLU and GSM8K.
-
ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
ETS enables direct sampling from the optimal RL policy for language models at inference time by estimating the energy term with online Monte Carlo and acceleration techniques.
-
DMax: Aggressive Parallel Decoding for dLLMs
DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...
Reference graph
Works this paper leans on
-
[1]
Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov
Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models, 2025
work page 2025
-
[2]
Structured denoising diffusion models in discrete state-spaces
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021
work page 2021
-
[3]
A continuous time framework for discrete denoising models
Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models. Advances in Neural Information Processing Systems, 35:28266–28279, 2022
work page 2022
-
[4]
Fast sampling via de-randomization for discrete diffusion models
Zixiang Chen, Huizhuo Yuan, Yongqian Li, Yiwen Kou, Junkai Zhang, and Quanquan Gu. Fast sampling via de-randomization for discrete diffusion models. arXiv preprint arXiv:2312.09193, 2023
-
[5]
Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching. arXiv preprint arXiv:2407.15595, 2024
-
[6]
Approximate accelerated stochastic simulation of chemically reacting systems
Daniel T Gillespie. Approximate accelerated stochastic simulation of chemically reacting systems. The Journal of chemical physics, 115(4):1716–1733, 2001
work page 2001
-
[7]
Scaling diffusion language models via adaptation from autoregressive models
Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models. arXiv preprint arXiv:2410.17891, 2024
-
[8]
Google DeepMind. Gemini diffusion. https://deepmind.google/models/gemini-diffusion, 2025. Accessed: 2025-05-24
work page 2025
-
[9]
The llama 3 herd of models, 2024
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, et al. The llama 3 herd of models, 2024
work page 2024
-
[10]
arXiv:2211.15029 [cs] version:
Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. Diffusionbert: Improving generative masked language models with diffusion models. arXiv preprint arXiv:2211.15029, 2022
-
[11]
Argmax flows and multinomial diffusion: Learning categorical distributions
Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454–12465, 2021
work page 2021
-
[12]
Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models, 2023
Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models, 2023
work page 2023
-
[13]
Introducing mercury: The first commercial diffusion-based language model
Inception Labs. Introducing mercury: The first commercial diffusion-based language model. https:// www.inceptionlabs.ai/introducing-mercury, 2025. Accessed: 2025-05-24
work page 2025
-
[14]
Disk: A diffusion model for structured knowledge
Ouail Kitouni, Niklas Nolte, James Hensman, and Bhaskar Mitra. Disk: A diffusion model for structured knowledge. arXiv preprint arXiv:2312.05253, 2023
-
[15]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, 2019
work page 2019
-
[16]
Anji Liu, Oliver Broadrick, Mathias Niepert, and Guy Van den Broeck. Discrete copula diffusion. arXiv preprint arXiv:2410.01949, 2024
-
[17]
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion language modeling by estimating the ratios of the data distribution. arXiv preprint arXiv:2310.16834, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Concrete score matching: Generalized score matching for discrete data
Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Concrete score matching: Generalized score matching for discrete data. Advances in Neural Information Processing Systems, 35:34532–34545, 2022
work page 2022
-
[19]
Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022
work page 2022
-
[20]
Scaling up masked diffusion models on text, 2025
Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text, 2025
work page 2025
-
[21]
Large language diffusion models, 2025
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models, 2025. 21 Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
work page 2025
-
[22]
Your absorbing discrete diffusion secretly models the conditional distributions of clean data
Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024
-
[23]
Zero-shot text-to-image generation, 2021
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation, 2021
work page 2021
-
[24]
Hellendoorn, and Graham Neubig
Machel Reid, Vincent J. Hellendoorn, and Graham Neubig. Diffuser: Discrete diffusion via edit-based reconstruction, 2022
work page 2022
-
[25]
High-resolution image synthesis with latent diffusion models, 2022
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022
work page 2022
-
[26]
Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding, 2022
work page 2022
-
[27]
Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.arXiv preprint arXiv:2406.07524, 2024
-
[28]
Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias. Simplified and generalized masked diffusion for discrete data. arXiv preprint arXiv:2406.04329, 2024
-
[29]
Deep unsupervised learning using nonequilib- rium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilib- rium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015
work page 2015
-
[30]
Ideas in inference-time scaling can benefit generative pre-training algorithms
Jiaming Song and Linqi Zhou. Ideas in inference-time scaling can benefit generative pre-training algorithms. arXiv preprint arXiv:2503.07154, 2025
-
[31]
Score-based continuous-time discrete diffusion models
Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, and Hanjun Dai. Score-based continuous-time discrete diffusion models. arXiv preprint arXiv:2211.16750, 2022
-
[32]
Ashish Vaswani. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[33]
A survey on non-autoregressive generation for neural machine translation and beyond, 2023
Yisheng Xiao, Lijun Wu, Junliang Guo, Juntao Li, Min Zhang, Tao Qin, and Tie yan Liu. A survey on non-autoregressive generation for neural machine translation and beyond, 2023
work page 2023
-
[34]
Energy-based diffusion language models for text generation
Minkai Xu, Tomas Geffner, Karsten Kreis, Weili Nie, Yilun Xu, Jure Leskovec, Stefano Ermon, and Arash Vahdat. Energy-based diffusion language models for text generation. arXiv preprint arXiv:2410.21357, 2024
-
[35]
Diffsound: Discrete diffusion model for text-to-sound generation, 2023
Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, and Dong Yu. Diffsound: Discrete diffusion model for text-to-sound generation, 2023
work page 2023
-
[36]
Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b, 2025
work page 2025
-
[37]
Diffusion language models can perform many tasks with scaling and instruction-finetuning
Jiasheng Ye, Zaixiang Zheng, Yu Bao, Lihua Qian, and Quanquan Gu. Diffusion language models can perform many tasks with scaling and instruction-finetuning. arXiv preprint arXiv:2308.12219, 2023
-
[38]
Llada-v: Large language diffusion models with visual instruction tuning
Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning. arXiv preprint arXiv:2505.16933, 2025
-
[39]
Discrete diffusion in large language and multimodal models: A survey, 2025
Runpeng Yu, Qi Li, and Xinchao Wang. Discrete diffusion in large language and multimodal models: A survey, 2025
work page 2025
-
[40]
Dimple: Discrete diffusion multimodal large language model with parallel decoding, 2025
Runpeng Yu, Xinyin Ma, and Xinchao Wang. Dimple: Discrete diffusion multimodal large language model with parallel decoding, 2025
work page 2025
-
[41]
Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. arXiv preprint arXiv:2409.02908, 2024
-
[42]
Judging llm-as-a-judge with mt-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023
work page 2023
-
[43]
Diffusion-nat: Self-prompting discrete diffusion for non-autoregressive text generation, 2023
Kun Zhou, Yifan Li, Wayne Xin Zhao, and Ji-Rong Wen. Diffusion-nat: Self-prompting discrete diffusion for non-autoregressive text generation, 2023
work page 2023
-
[44]
Llada 1.5: Variance-reduced preference optimization for large language diffusion models, 2025
Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Llada 1.5: Variance-reduced preference optimization for large language diffusion models, 2025. 22
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.