PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models
Pith reviewed 2026-05-21 05:02 UTC · model grok-4.3
The pith
PulseCol achieves up to 1.95x speedup for diffusion language models using periodically refreshed column-sparse attention.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that column-sparse attention patterns identified at the first denoising step can be reused across most iterations, with refreshes only at a few intermediate points, yielding higher sparsity, maintained quality, and up to 1.95× end-to-end speedup over FlashAttention across context lengths.
What carries the argument
Periodically refreshed column-sparse attention that selects important columns for computation and updates the selection infrequently to follow pattern changes during denoising.
Load-bearing premise
Sparse patterns from early denoising steps stay valid enough for reuse with only occasional refreshes and do not degrade the final generated text quality.
What would settle it
Measure if outputs from PulseCol match the quality of full-attention outputs on standard benchmarks like perplexity or coherence scores when using the same number of denoising steps.
Figures
read the original abstract
Inference in diffusion large language models (dLLMs) is computationally expensive, as full self-attention must be repeatedly executed at each step of the denoising process without KV cache. Recent sparse attention methods for dLLMs mitigate this cost via block-sparse computation, which is applied only in later iterations when model performance is less sensitive to coarse-grained sparse approximation, but yields limited improvements in computational efficiency and acceleration. This motivates a finer-grained sparsification strategy that can be applied from earlier iterations and leverages reusable sparsity patterns, enabling further efficiency gains. In this work, we introduce PulseCol, a periodically refreshed column-sparse attention method for accelerating diffusion language models. PulseCol replaces coarse block-level sparsity with a finer-grained column-sparse structure, allowing important attention interactions to be retained more precisely while exposing greater sparsity. Built on this column-level formulation, PulseCol further identifies sparse patterns at the early denoising step and reuses them across subsequent iterations, refreshing them only at a small number of intermediate steps to track the evolution of sparse attention patterns during denoising. Experiments show that PulseCol achieves higher sparsity and greater practical speedup than prior sparse attention methods for dLLMs, while maintaining model quality. Enabled by optimized GPU kernels for column-sparse attention, PulseCol delivers up to 1.95$\times$ end-to-end speedup over FlashAttention across several context lengths.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PulseCol, a periodically refreshed column-sparse attention method for accelerating inference in diffusion large language models (dLLMs). It replaces prior block-sparse approximations (applied only in later denoising iterations) with finer-grained column sparsity, computes sparse masks at the initial denoising step, reuses them across subsequent steps, and refreshes only at a small number of intermediate points to track pattern evolution. Optimized GPU kernels for column-sparse attention are presented, with experiments claiming higher sparsity, up to 1.95× end-to-end speedup over FlashAttention across context lengths, and maintained model quality.
Significance. If the empirical results hold under scrutiny, this approach could provide a practical advance for efficient dLLM inference by enabling earlier and more precise sparsification through reusable column-level patterns rather than coarse blocks. The concrete design choices (column sparsity ratio and refresh interval as free parameters) and reported practical speedups via custom kernels are strengths. However, the central claim of quality preservation rests on the stability of early-identified patterns, which requires stronger empirical grounding to fully assess impact.
major comments (2)
- [§3] §3 (Method): The description of computing column-sparse masks at the initial denoising step and reusing them until periodic refreshes does not include a quantitative measure or bound on attention pattern drift across the denoising trajectory. This is load-bearing for the speedup and quality claims, as any material shift in important columns between refreshes would accumulate approximation error.
- [§4] §4 (Experiments): No ablation is reported on varying the refresh interval (one of the two free parameters) or on the number of refresh points, nor are quality metrics (e.g., perplexity, coherence scores) shown as a function of refresh frequency. Without this, it is difficult to verify that infrequent refreshes suffice to maintain output quality as asserted in the abstract.
minor comments (2)
- The abstract states results 'across several context lengths' but the main text would benefit from an explicit table or figure listing the exact lengths tested and per-length speedup/quality numbers for reproducibility.
- Notation for the column sparsity ratio and refresh schedule could be introduced earlier with a clear mathematical definition to aid readers following the algorithmic description.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that will provide additional empirical support for the stability of the proposed sparsity patterns.
read point-by-point responses
-
Referee: [§3] §3 (Method): The description of computing column-sparse masks at the initial denoising step and reusing them until periodic refreshes does not include a quantitative measure or bound on attention pattern drift across the denoising trajectory. This is load-bearing for the speedup and quality claims, as any material shift in important columns between refreshes would accumulate approximation error.
Authors: We agree that an explicit quantitative analysis of attention pattern drift would strengthen the justification for periodic reuse. The current manuscript relies on end-to-end quality preservation in experiments to imply that drift remains limited within the chosen refresh intervals, but does not report direct metrics such as column overlap or drift bounds. In the revised version we will add such measurements across the denoising trajectory to quantify stability and bound the approximation error. revision: yes
-
Referee: [§4] §4 (Experiments): No ablation is reported on varying the refresh interval (one of the two free parameters) or on the number of refresh points, nor are quality metrics (e.g., perplexity, coherence scores) shown as a function of refresh frequency. Without this, it is difficult to verify that infrequent refreshes suffice to maintain output quality as asserted in the abstract.
Authors: We acknowledge that a sensitivity analysis on the refresh interval would better demonstrate robustness. The submitted manuscript reports results only for the specific refresh schedule used to obtain the claimed speedups and quality, without varying the interval or number of refresh points. In the revision we will include an ablation varying the refresh interval, reporting perplexity and other quality metrics as a function of refresh frequency to verify that infrequent refreshes suffice. revision: yes
Circularity Check
No significant circularity; algorithmic proposal is self-contained
full rationale
The paper defines PulseCol via explicit algorithmic choices: column-sparse attention replacing block sparsity, with sparse masks computed at the initial denoising step and reused until a small number of refresh points. These are concrete design decisions evaluated empirically against external baselines such as FlashAttention. No equations or claims reduce by construction to fitted inputs, self-citations, or prior ansatzes from the same authors. The central performance claims rest on experimental measurements rather than tautological re-derivations, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- refresh interval
- column sparsity ratio
axioms (1)
- domain assumption Attention patterns in early denoising steps are sufficiently stable to be reused for multiple subsequent steps.
Reference graph
Works this paper leans on
-
[1]
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021
work page 2021
-
[2]
Heli Ben-Hamu, Itai Gat, Daniel Severo, Niklas Nolte, and Brian Karrer. Accelerated sampling from masked diffusion models via entropy bounded unmasking.arXiv preprint arXiv:2505.24857, 2025
-
[3]
Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
dparallel: Learnable parallel decoding for dllms.arXiv preprint arXiv:2509.26488, 2025
Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, and Xinchao Wang. dparallel: Learnable parallel decoding for dllms.arXiv preprint arXiv:2509.26488, 2025
-
[6]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[7]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Scaling Diffusion Language Models via Adaptation from Autoregressive Models
Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models.arXiv preprint arXiv:2410.17891, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
S., Seo, J.-s., Zhang, Z., and Gupta, U
Zhanqiu Hu, Jian Meng, Yash Akhauri, Mohamed S Abdelfattah, Jae-sun Seo, Zhiru Zhang, and Udit Gupta. Flashdlm: Accelerating diffusion language model inference via efficient kv caching and guided diffusion.arXiv preprint arXiv:2505.21467, 2025
-
[12]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023
work page 2023
-
[13]
Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. Flexprefill: A context- aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766, 2025
-
[14]
Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022
work page 2022
-
[15]
Liu, J., Dong, X., Ye, Z., Mehta, R., Fu, Y ., Singh, V ., Kautz, J., Zhang, C., and Molchanov, P
Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, and Linfeng Zhang. dllm-cache: Accelerating diffusion large language models with adaptive caching.arXiv preprint arXiv:2506.06295, 2025
-
[16]
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
dkv-cache: The cache for diffusion language models.arXiv preprint arXiv:2505.15781, 2025
Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. dkv-cache: The cache for diffusion language models.arXiv preprint arXiv:2505.15781, 2025. 10
-
[18]
Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data
Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Attention sinks in diffusion language models.arXiv preprint arXiv:2510.15731, 2025
Maximo Eduardo Rulli, Simone Petruzzi, Edoardo Michielon, Fabrizio Silvestri, Simone Scardapane, and Alessio Devoto. Attention sinks in diffusion language models.arXiv preprint arXiv:2510.15731, 2025
-
[21]
Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024
work page 2024
-
[22]
Sparse-dllm: Accelerating diffusion llms with dynamic cache eviction
Yuerong Song, Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, and Xipeng Qiu. Sparse-dllm: Accelerating diffusion llms with dynamic cache eviction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33038–33046, 2026
work page 2026
-
[23]
Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, and Zhijie Deng. Diffusion llms can do faster-than-ar inference via discrete diffusion forcing.arXiv preprint arXiv:2508.09192, 2025
-
[24]
Sparsed: Sparse attention for diffusion language models.arXiv preprint arXiv:2509.24014, 2025
Zeqing Wang, Gongfan Fang, Xinyin Ma, Xingyi Yang, and Xinchao Wang. Sparsed: Sparse attention for diffusion language models.arXiv preprint arXiv:2509.24014, 2025
-
[25]
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models
Haocheng Xi, Harman Singh, Yuezhou Hu, Coleman Hooper, Rishabh Tiwari, Aditya Tomar, Minjae Lee, Wonjun Kang, Michael Mahoney, Chenfeng Xu, et al. Losa: Locality aware sparse attention for block-wise diffusion language models.arXiv preprint arXiv:2604.12056, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Dream 7B: Diffusion Large Language Models
Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33: 17283–17297, 2020
work page 2020
-
[30]
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023
work page 2023
-
[31]
A reparameterized discrete diffusion model for text generation.arXiv preprint arXiv:2302.05737, 2023
Lin Zheng, Jianbo Yuan, Lei Yu, and Lingpeng Kong. A reparameterized discrete diffusion model for text generation.arXiv preprint arXiv:2302.05737, 2023
-
[32]
LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, et al. Llada 1.5: Variance-reduced preference optimization for large language diffusion models.arXiv preprint arXiv:2505.19223, 2025. 11 A Additional Visualizations of Column-Sparse Patterns Figure 7 provides additional attention visuali...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.