pith. sign in

arxiv: 2605.20813 · v1 · pith:MA7NVGGNnew · submitted 2026-05-20 · 💻 cs.CL

PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models

Pith reviewed 2026-05-21 05:02 UTC · model grok-4.3

classification 💻 cs.CL
keywords diffusion language modelssparse attentioncolumn-sparseinference speedupperiodic refreshself-attention optimizationGPU kernels
0
0 comments X

The pith

PulseCol achieves up to 1.95x speedup for diffusion language models using periodically refreshed column-sparse attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tries to show that replacing full self-attention with a column-sparse version refreshed only at select steps allows diffusion language models to run inference much faster. By finding the important columns early and reusing the pattern with occasional updates, it achieves more sparsity than block-based methods that start later. If this holds, it means the expensive repeated attention calculations during denoising become far less costly, making longer context generation more feasible on current hardware. Custom kernels make the column sparsity practical on GPUs.

Core claim

The central discovery is that column-sparse attention patterns identified at the first denoising step can be reused across most iterations, with refreshes only at a few intermediate points, yielding higher sparsity, maintained quality, and up to 1.95× end-to-end speedup over FlashAttention across context lengths.

What carries the argument

Periodically refreshed column-sparse attention that selects important columns for computation and updates the selection infrequently to follow pattern changes during denoising.

Load-bearing premise

Sparse patterns from early denoising steps stay valid enough for reuse with only occasional refreshes and do not degrade the final generated text quality.

What would settle it

Measure if outputs from PulseCol match the quality of full-attention outputs on standard benchmarks like perplexity or coherence scores when using the same number of denoising steps.

Figures

Figures reproduced from arXiv: 2605.20813 by Futing Sun, Letian Chen, Liqiang Nie, Miao Zhang, Weili Guan, Yanyi Lyu.

Figure 1
Figure 1. Figure 1: PulseCol improves the sparsity-efficiency trade-off of dLLM inference. (a) On GSM8K, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Early denoising attention in LLaDA-1.5 exhibits column sparsity. We visualize representa [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of PulseCol. By constructing, refreshing, and reusing column-sparse indices, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Workflow of the column-sparse atten￾tion kernel. Each query block attends only to indexed key-value tiles, updates online softmax statistics in SRAM, and writes the normalized output back to HBM without materializing the full attention matrix. The kernel schedules computation at the granu￾larity of query blocks. For each query block i, it loads Qi from HBM into SRAM, initializes the row-wise online softmax… view at source ↗
Figure 5
Figure 5. Figure 5: Speedup of our column-sparse ker￾nel over FlashAttention under different spar￾sity levels and context lengths. Latency. We evaluate the standalone efficiency of the column-sparse attention kernel. Unless otherwise stated, all latency measurements are conducted on an NVIDIA RTX PRO 6000 GPU [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Hyperparameter analysis on HumanEval. We study the effects of query group size, sparse [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional attention visualizations in LLaDA-1.5. Each layer shows one representative [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Kernel speedup over FlashAttention with different query group sizes. The two panels [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
read the original abstract

Inference in diffusion large language models (dLLMs) is computationally expensive, as full self-attention must be repeatedly executed at each step of the denoising process without KV cache. Recent sparse attention methods for dLLMs mitigate this cost via block-sparse computation, which is applied only in later iterations when model performance is less sensitive to coarse-grained sparse approximation, but yields limited improvements in computational efficiency and acceleration. This motivates a finer-grained sparsification strategy that can be applied from earlier iterations and leverages reusable sparsity patterns, enabling further efficiency gains. In this work, we introduce PulseCol, a periodically refreshed column-sparse attention method for accelerating diffusion language models. PulseCol replaces coarse block-level sparsity with a finer-grained column-sparse structure, allowing important attention interactions to be retained more precisely while exposing greater sparsity. Built on this column-level formulation, PulseCol further identifies sparse patterns at the early denoising step and reuses them across subsequent iterations, refreshing them only at a small number of intermediate steps to track the evolution of sparse attention patterns during denoising. Experiments show that PulseCol achieves higher sparsity and greater practical speedup than prior sparse attention methods for dLLMs, while maintaining model quality. Enabled by optimized GPU kernels for column-sparse attention, PulseCol delivers up to 1.95$\times$ end-to-end speedup over FlashAttention across several context lengths.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PulseCol, a periodically refreshed column-sparse attention method for accelerating inference in diffusion large language models (dLLMs). It replaces prior block-sparse approximations (applied only in later denoising iterations) with finer-grained column sparsity, computes sparse masks at the initial denoising step, reuses them across subsequent steps, and refreshes only at a small number of intermediate points to track pattern evolution. Optimized GPU kernels for column-sparse attention are presented, with experiments claiming higher sparsity, up to 1.95× end-to-end speedup over FlashAttention across context lengths, and maintained model quality.

Significance. If the empirical results hold under scrutiny, this approach could provide a practical advance for efficient dLLM inference by enabling earlier and more precise sparsification through reusable column-level patterns rather than coarse blocks. The concrete design choices (column sparsity ratio and refresh interval as free parameters) and reported practical speedups via custom kernels are strengths. However, the central claim of quality preservation rests on the stability of early-identified patterns, which requires stronger empirical grounding to fully assess impact.

major comments (2)
  1. [§3] §3 (Method): The description of computing column-sparse masks at the initial denoising step and reusing them until periodic refreshes does not include a quantitative measure or bound on attention pattern drift across the denoising trajectory. This is load-bearing for the speedup and quality claims, as any material shift in important columns between refreshes would accumulate approximation error.
  2. [§4] §4 (Experiments): No ablation is reported on varying the refresh interval (one of the two free parameters) or on the number of refresh points, nor are quality metrics (e.g., perplexity, coherence scores) shown as a function of refresh frequency. Without this, it is difficult to verify that infrequent refreshes suffice to maintain output quality as asserted in the abstract.
minor comments (2)
  1. The abstract states results 'across several context lengths' but the main text would benefit from an explicit table or figure listing the exact lengths tested and per-length speedup/quality numbers for reproducibility.
  2. Notation for the column sparsity ratio and refresh schedule could be introduced earlier with a clear mathematical definition to aid readers following the algorithmic description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that will provide additional empirical support for the stability of the proposed sparsity patterns.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The description of computing column-sparse masks at the initial denoising step and reusing them until periodic refreshes does not include a quantitative measure or bound on attention pattern drift across the denoising trajectory. This is load-bearing for the speedup and quality claims, as any material shift in important columns between refreshes would accumulate approximation error.

    Authors: We agree that an explicit quantitative analysis of attention pattern drift would strengthen the justification for periodic reuse. The current manuscript relies on end-to-end quality preservation in experiments to imply that drift remains limited within the chosen refresh intervals, but does not report direct metrics such as column overlap or drift bounds. In the revised version we will add such measurements across the denoising trajectory to quantify stability and bound the approximation error. revision: yes

  2. Referee: [§4] §4 (Experiments): No ablation is reported on varying the refresh interval (one of the two free parameters) or on the number of refresh points, nor are quality metrics (e.g., perplexity, coherence scores) shown as a function of refresh frequency. Without this, it is difficult to verify that infrequent refreshes suffice to maintain output quality as asserted in the abstract.

    Authors: We acknowledge that a sensitivity analysis on the refresh interval would better demonstrate robustness. The submitted manuscript reports results only for the specific refresh schedule used to obtain the claimed speedups and quality, without varying the interval or number of refresh points. In the revision we will include an ablation varying the refresh interval, reporting perplexity and other quality metrics as a function of refresh frequency to verify that infrequent refreshes suffice. revision: yes

Circularity Check

0 steps flagged

No significant circularity; algorithmic proposal is self-contained

full rationale

The paper defines PulseCol via explicit algorithmic choices: column-sparse attention replacing block sparsity, with sparse masks computed at the initial denoising step and reused until a small number of refresh points. These are concrete design decisions evaluated empirically against external baselines such as FlashAttention. No equations or claims reduce by construction to fitted inputs, self-citations, or prior ansatzes from the same authors. The central performance claims rest on experimental measurements rather than tautological re-derivations, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on the empirical observation that attention patterns evolve slowly enough to allow reuse with periodic updates. No new physical entities or mathematical axioms beyond standard transformer attention are introduced.

free parameters (2)
  • refresh interval
    Number of denoising steps between pattern recomputations; chosen to balance accuracy and speed.
  • column sparsity ratio
    Fraction of columns retained per attention head; tuned for quality-speed tradeoff.
axioms (1)
  • domain assumption Attention patterns in early denoising steps are sufficiently stable to be reused for multiple subsequent steps.
    Invoked to justify the periodic refresh strategy.

pith-pipeline@v0.9.0 · 5789 in / 1242 out tokens · 22039 ms · 2026-05-21T05:02:48.299999+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 15 internal anchors

  1. [1]

    Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

    Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

  2. [2]

    Accelerated sampling from masked diffusion models via entropy bounded unmasking.arXiv preprint arXiv:2505.24857, 2025

    Heli Ben-Hamu, Itai Gat, Daniel Severo, Niklas Nolte, and Brian Karrer. Accelerated sampling from masked diffusion models via entropy bounded unmasking.arXiv preprint arXiv:2505.24857, 2025

  3. [3]

    Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745, 2025

  4. [4]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  5. [5]

    dparallel: Learnable parallel decoding for dllms.arXiv preprint arXiv:2509.26488, 2025

    Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, and Xinchao Wang. dparallel: Learnable parallel decoding for dllms.arXiv preprint arXiv:2509.26488, 2025

  6. [6]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

  7. [7]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  8. [8]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

  9. [9]

    Scaling Diffusion Language Models via Adaptation from Autoregressive Models

    Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models.arXiv preprint arXiv:2410.17891, 2024

  10. [10]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

  11. [11]

    S., Seo, J.-s., Zhang, Z., and Gupta, U

    Zhanqiu Hu, Jian Meng, Yash Akhauri, Mohamed S Abdelfattah, Jae-sun Seo, Zhiru Zhang, and Udit Gupta. Flashdlm: Accelerating diffusion language model inference via efficient kv caching and guided diffusion.arXiv preprint arXiv:2505.21467, 2025

  12. [12]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  13. [13]

    Flex- prefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766,

    Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. Flexprefill: A context- aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766, 2025

  14. [14]

    Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022

    Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022

  15. [15]

    Liu, J., Dong, X., Ye, Z., Mehta, R., Fu, Y ., Singh, V ., Kautz, J., Zhang, C., and Molchanov, P

    Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, and Linfeng Zhang. dllm-cache: Accelerating diffusion large language models with adaptive caching.arXiv preprint arXiv:2506.06295, 2025

  16. [16]

    Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

    Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

  17. [17]

    dkv-cache: The cache for diffusion language models.arXiv preprint arXiv:2505.15781, 2025

    Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. dkv-cache: The cache for diffusion language models.arXiv preprint arXiv:2505.15781, 2025. 10

  18. [18]

    Large Language Diffusion Models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

  19. [19]

    Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

    Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024

  20. [20]

    Attention sinks in diffusion language models.arXiv preprint arXiv:2510.15731, 2025

    Maximo Eduardo Rulli, Simone Petruzzi, Edoardo Michielon, Fabrizio Silvestri, Simone Scardapane, and Alessio Devoto. Attention sinks in diffusion language models.arXiv preprint arXiv:2510.15731, 2025

  21. [21]

    Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

    Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

  22. [22]

    Sparse-dllm: Accelerating diffusion llms with dynamic cache eviction

    Yuerong Song, Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, and Xipeng Qiu. Sparse-dllm: Accelerating diffusion llms with dynamic cache eviction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33038–33046, 2026

  23. [23]

    Diffusion llms can do faster-than-ar inference via discrete diffusion forcing.arXiv preprint arXiv:2508.09192, 2025

    Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, and Zhijie Deng. Diffusion llms can do faster-than-ar inference via discrete diffusion forcing.arXiv preprint arXiv:2508.09192, 2025

  24. [24]

    Sparsed: Sparse attention for diffusion language models.arXiv preprint arXiv:2509.24014, 2025

    Zeqing Wang, Gongfan Fang, Xinyin Ma, Xingyi Yang, and Xinchao Wang. Sparsed: Sparse attention for diffusion language models.arXiv preprint arXiv:2509.24014, 2025

  25. [25]

    Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

    Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025

  26. [26]

    LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

    Haocheng Xi, Harman Singh, Yuezhou Hu, Coleman Hooper, Rishabh Tiwari, Aditya Tomar, Minjae Lee, Wonjun Kang, Michael Mahoney, Chenfeng Xu, et al. Losa: Locality aware sparse attention for block-wise diffusion language models.arXiv preprint arXiv:2604.12056, 2026

  27. [27]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

  28. [28]

    Dream 7B: Diffusion Large Language Models

    Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

  29. [29]

    Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33: 17283–17297, 2020

    Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33: 17283–17297, 2020

  30. [30]

    H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

  31. [31]

    A reparameterized discrete diffusion model for text generation.arXiv preprint arXiv:2302.05737, 2023

    Lin Zheng, Jianbo Yuan, Lei Yu, and Lingpeng Kong. A reparameterized discrete diffusion model for text generation.arXiv preprint arXiv:2302.05737, 2023

  32. [32]

    LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

    Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, et al. Llada 1.5: Variance-reduced preference optimization for large language diffusion models.arXiv preprint arXiv:2505.19223, 2025. 11 A Additional Visualizations of Column-Sparse Patterns Figure 7 provides additional attention visuali...