PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models

Futing Sun; Letian Chen; Liqiang Nie; Miao Zhang; Weili Guan; Yanyi Lyu

arxiv: 2605.20813 · v1 · pith:MA7NVGGNnew · submitted 2026-05-20 · 💻 cs.CL

PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models

Yanyi Lyu , Letian Chen , Futing Sun , Miao Zhang , Weili Guan , Liqiang Nie This is my paper

Pith reviewed 2026-05-21 05:02 UTC · model grok-4.3

classification 💻 cs.CL

keywords diffusion language modelssparse attentioncolumn-sparseinference speedupperiodic refreshself-attention optimizationGPU kernels

0 comments

The pith

PulseCol achieves up to 1.95x speedup for diffusion language models using periodically refreshed column-sparse attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tries to show that replacing full self-attention with a column-sparse version refreshed only at select steps allows diffusion language models to run inference much faster. By finding the important columns early and reusing the pattern with occasional updates, it achieves more sparsity than block-based methods that start later. If this holds, it means the expensive repeated attention calculations during denoising become far less costly, making longer context generation more feasible on current hardware. Custom kernels make the column sparsity practical on GPUs.

Core claim

The central discovery is that column-sparse attention patterns identified at the first denoising step can be reused across most iterations, with refreshes only at a few intermediate points, yielding higher sparsity, maintained quality, and up to 1.95× end-to-end speedup over FlashAttention across context lengths.

What carries the argument

Periodically refreshed column-sparse attention that selects important columns for computation and updates the selection infrequently to follow pattern changes during denoising.

Load-bearing premise

Sparse patterns from early denoising steps stay valid enough for reuse with only occasional refreshes and do not degrade the final generated text quality.

What would settle it

Measure if outputs from PulseCol match the quality of full-attention outputs on standard benchmarks like perplexity or coherence scores when using the same number of denoising steps.

Figures

Figures reproduced from arXiv: 2605.20813 by Futing Sun, Letian Chen, Liqiang Nie, Miao Zhang, Weili Guan, Yanyi Lyu.

**Figure 2.** Figure 2: Early denoising attention in LLaDA-1.5 exhibits column sparsity. We visualize representa [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of PulseCol. By constructing, refreshing, and reusing column-sparse indices, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Workflow of the column-sparse attention kernel. Each query block attends only to indexed key-value tiles, updates online softmax statistics in SRAM, and writes the normalized output back to HBM without materializing the full attention matrix. The kernel schedules computation at the granularity of query blocks. For each query block i, it loads Qi from HBM into SRAM, initializes the row-wise online softmax… view at source ↗

**Figure 5.** Figure 5: Speedup of our column-sparse kernel over FlashAttention under different sparsity levels and context lengths. Latency. We evaluate the standalone efficiency of the column-sparse attention kernel. Unless otherwise stated, all latency measurements are conducted on an NVIDIA RTX PRO 6000 GPU [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Hyperparameter analysis on HumanEval. We study the effects of query group size, sparse [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Additional attention visualizations in LLaDA-1.5. Each layer shows one representative [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Kernel speedup over FlashAttention with different query group sizes. The two panels [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

read the original abstract

Inference in diffusion large language models (dLLMs) is computationally expensive, as full self-attention must be repeatedly executed at each step of the denoising process without KV cache. Recent sparse attention methods for dLLMs mitigate this cost via block-sparse computation, which is applied only in later iterations when model performance is less sensitive to coarse-grained sparse approximation, but yields limited improvements in computational efficiency and acceleration. This motivates a finer-grained sparsification strategy that can be applied from earlier iterations and leverages reusable sparsity patterns, enabling further efficiency gains. In this work, we introduce PulseCol, a periodically refreshed column-sparse attention method for accelerating diffusion language models. PulseCol replaces coarse block-level sparsity with a finer-grained column-sparse structure, allowing important attention interactions to be retained more precisely while exposing greater sparsity. Built on this column-level formulation, PulseCol further identifies sparse patterns at the early denoising step and reuses them across subsequent iterations, refreshing them only at a small number of intermediate steps to track the evolution of sparse attention patterns during denoising. Experiments show that PulseCol achieves higher sparsity and greater practical speedup than prior sparse attention methods for dLLMs, while maintaining model quality. Enabled by optimized GPU kernels for column-sparse attention, PulseCol delivers up to 1.95$\times$ end-to-end speedup over FlashAttention across several context lengths.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PulseCol's column-sparse attention with periodic refreshes delivers reported speedups for dLLMs, but the reuse of early patterns lacks supporting ablations or drift measurements.

read the letter

The main point is that PulseCol applies column-level sparsity to diffusion LLM attention, identifies the pattern early in denoising, and reuses it with only a few refreshes to cut computation. It reports up to 1.95x end-to-end speedup over FlashAttention while keeping output quality comparable. This is a direct engineering move from the block-sparse approaches mentioned in the abstract. The finer granularity lets them keep more precise interactions and apply sparsity sooner than methods limited to later steps. The GPU kernels for column-sparse attention are presented as the enabler for the practical gains. Experiments across context lengths show the sparsity and speed improvements against standard baselines. The work targets a clear bottleneck in dLLM inference where full attention runs repeatedly without KV caching. The soft spot sits in the reuse assumption. The method computes the column mask at the first step and refreshes only at selected later points, yet the abstract gives no numbers on how much the important columns actually shift between refreshes or any ablation on refresh interval. If the attention focus moves noticeably during denoising, the accumulated approximation error could affect coherence even if average metrics look fine. That link between the algorithmic choice and the quality claim is the least anchored part of the story. Readers working on efficient inference for diffusion models would find the sparsity formulation and kernel details useful. The paper is not proposing a new capability but a targeted optimization that could matter for deployment. It is worth sending for peer review so the experimental details and any additional validation on pattern stability can be checked properly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PulseCol, a periodically refreshed column-sparse attention method for accelerating inference in diffusion large language models (dLLMs). It replaces prior block-sparse approximations (applied only in later denoising iterations) with finer-grained column sparsity, computes sparse masks at the initial denoising step, reuses them across subsequent steps, and refreshes only at a small number of intermediate points to track pattern evolution. Optimized GPU kernels for column-sparse attention are presented, with experiments claiming higher sparsity, up to 1.95× end-to-end speedup over FlashAttention across context lengths, and maintained model quality.

Significance. If the empirical results hold under scrutiny, this approach could provide a practical advance for efficient dLLM inference by enabling earlier and more precise sparsification through reusable column-level patterns rather than coarse blocks. The concrete design choices (column sparsity ratio and refresh interval as free parameters) and reported practical speedups via custom kernels are strengths. However, the central claim of quality preservation rests on the stability of early-identified patterns, which requires stronger empirical grounding to fully assess impact.

major comments (2)

[§3] §3 (Method): The description of computing column-sparse masks at the initial denoising step and reusing them until periodic refreshes does not include a quantitative measure or bound on attention pattern drift across the denoising trajectory. This is load-bearing for the speedup and quality claims, as any material shift in important columns between refreshes would accumulate approximation error.
[§4] §4 (Experiments): No ablation is reported on varying the refresh interval (one of the two free parameters) or on the number of refresh points, nor are quality metrics (e.g., perplexity, coherence scores) shown as a function of refresh frequency. Without this, it is difficult to verify that infrequent refreshes suffice to maintain output quality as asserted in the abstract.

minor comments (2)

The abstract states results 'across several context lengths' but the main text would benefit from an explicit table or figure listing the exact lengths tested and per-length speedup/quality numbers for reproducibility.
Notation for the column sparsity ratio and refresh schedule could be introduced earlier with a clear mathematical definition to aid readers following the algorithmic description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that will provide additional empirical support for the stability of the proposed sparsity patterns.

read point-by-point responses

Referee: [§3] §3 (Method): The description of computing column-sparse masks at the initial denoising step and reusing them until periodic refreshes does not include a quantitative measure or bound on attention pattern drift across the denoising trajectory. This is load-bearing for the speedup and quality claims, as any material shift in important columns between refreshes would accumulate approximation error.

Authors: We agree that an explicit quantitative analysis of attention pattern drift would strengthen the justification for periodic reuse. The current manuscript relies on end-to-end quality preservation in experiments to imply that drift remains limited within the chosen refresh intervals, but does not report direct metrics such as column overlap or drift bounds. In the revised version we will add such measurements across the denoising trajectory to quantify stability and bound the approximation error. revision: yes
Referee: [§4] §4 (Experiments): No ablation is reported on varying the refresh interval (one of the two free parameters) or on the number of refresh points, nor are quality metrics (e.g., perplexity, coherence scores) shown as a function of refresh frequency. Without this, it is difficult to verify that infrequent refreshes suffice to maintain output quality as asserted in the abstract.

Authors: We acknowledge that a sensitivity analysis on the refresh interval would better demonstrate robustness. The submitted manuscript reports results only for the specific refresh schedule used to obtain the claimed speedups and quality, without varying the interval or number of refresh points. In the revision we will include an ablation varying the refresh interval, reporting perplexity and other quality metrics as a function of refresh frequency to verify that infrequent refreshes suffice. revision: yes

Circularity Check

0 steps flagged

No significant circularity; algorithmic proposal is self-contained

full rationale

The paper defines PulseCol via explicit algorithmic choices: column-sparse attention replacing block sparsity, with sparse masks computed at the initial denoising step and reused until a small number of refresh points. These are concrete design decisions evaluated empirically against external baselines such as FlashAttention. No equations or claims reduce by construction to fitted inputs, self-citations, or prior ansatzes from the same authors. The central performance claims rest on experimental measurements rather than tautological re-derivations, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on the empirical observation that attention patterns evolve slowly enough to allow reuse with periodic updates. No new physical entities or mathematical axioms beyond standard transformer attention are introduced.

free parameters (2)

refresh interval
Number of denoising steps between pattern recomputations; chosen to balance accuracy and speed.
column sparsity ratio
Fraction of columns retained per attention head; tuned for quality-speed tradeoff.

axioms (1)

domain assumption Attention patterns in early denoising steps are sufficiently stable to be reused for multiple subsequent steps.
Invoked to justify the periodic refresh strategy.

pith-pipeline@v0.9.0 · 5789 in / 1242 out tokens · 22039 ms · 2026-05-21T05:02:48.299999+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 15 internal anchors

[1]

Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

work page 2021
[2]

Accelerated sampling from masked diffusion models via entropy bounded unmasking.arXiv preprint arXiv:2505.24857, 2025

Heli Ben-Hamu, Itai Gat, Daniel Severo, Niklas Nolte, and Brian Karrer. Accelerated sampling from masked diffusion models via entropy bounded unmasking.arXiv preprint arXiv:2505.24857, 2025

work page arXiv 2025
[3]

Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

dparallel: Learnable parallel decoding for dllms.arXiv preprint arXiv:2509.26488, 2025

Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, and Xinchao Wang. dparallel: Learnable parallel decoding for dllms.arXiv preprint arXiv:2509.26488, 2025

work page arXiv 2025
[6]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models.arXiv preprint arXiv:2410.17891, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

S., Seo, J.-s., Zhang, Z., and Gupta, U

Zhanqiu Hu, Jian Meng, Yash Akhauri, Mohamed S Abdelfattah, Jae-sun Seo, Zhiru Zhang, and Udit Gupta. Flashdlm: Accelerating diffusion language model inference via efficient kv caching and guided diffusion.arXiv preprint arXiv:2505.21467, 2025

work page arXiv 2025
[12]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023
[13]

Flex- prefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766,

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. Flexprefill: A context- aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766, 2025

work page arXiv 2025
[14]

Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022

Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022

work page 2022
[15]

Liu, J., Dong, X., Ye, Z., Mehta, R., Fu, Y ., Singh, V ., Kautz, J., Zhang, C., and Molchanov, P

Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, and Linfeng Zhang. dllm-cache: Accelerating diffusion large language models with adaptive caching.arXiv preprint arXiv:2506.06295, 2025

work page arXiv 2025
[16]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

dkv-cache: The cache for diffusion language models.arXiv preprint arXiv:2505.15781, 2025

Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. dkv-cache: The cache for diffusion language models.arXiv preprint arXiv:2505.15781, 2025. 10

work page arXiv 2025
[18]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Attention sinks in diffusion language models.arXiv preprint arXiv:2510.15731, 2025

Maximo Eduardo Rulli, Simone Petruzzi, Edoardo Michielon, Fabrizio Silvestri, Simone Scardapane, and Alessio Devoto. Attention sinks in diffusion language models.arXiv preprint arXiv:2510.15731, 2025

work page arXiv 2025
[21]

Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

work page 2024
[22]

Sparse-dllm: Accelerating diffusion llms with dynamic cache eviction

Yuerong Song, Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, and Xipeng Qiu. Sparse-dllm: Accelerating diffusion llms with dynamic cache eviction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33038–33046, 2026

work page 2026
[23]

Diffusion llms can do faster-than-ar inference via discrete diffusion forcing.arXiv preprint arXiv:2508.09192, 2025

Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, and Zhijie Deng. Diffusion llms can do faster-than-ar inference via discrete diffusion forcing.arXiv preprint arXiv:2508.09192, 2025

work page arXiv 2025
[24]

Sparsed: Sparse attention for diffusion language models.arXiv preprint arXiv:2509.24014, 2025

Zeqing Wang, Gongfan Fang, Xinyin Ma, Xingyi Yang, and Xinchao Wang. Sparsed: Sparse attention for diffusion language models.arXiv preprint arXiv:2509.24014, 2025

work page arXiv 2025
[25]

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

Haocheng Xi, Harman Singh, Yuezhou Hu, Coleman Hooper, Rishabh Tiwari, Aditya Tomar, Minjae Lee, Wonjun Kang, Michael Mahoney, Chenfeng Xu, et al. Losa: Locality aware sparse attention for block-wise diffusion language models.arXiv preprint arXiv:2604.12056, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33: 17283–17297, 2020

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33: 17283–17297, 2020

work page 2020
[30]

H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

work page 2023
[31]

A reparameterized discrete diffusion model for text generation.arXiv preprint arXiv:2302.05737, 2023

Lin Zheng, Jianbo Yuan, Lei Yu, and Lingpeng Kong. A reparameterized discrete diffusion model for text generation.arXiv preprint arXiv:2302.05737, 2023

work page arXiv 2023
[32]

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, et al. Llada 1.5: Variance-reduced preference optimization for large language diffusion models.arXiv preprint arXiv:2505.19223, 2025. 11 A Additional Visualizations of Column-Sparse Patterns Figure 7 provides additional attention visuali...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

work page 2021

[2] [2]

Accelerated sampling from masked diffusion models via entropy bounded unmasking.arXiv preprint arXiv:2505.24857, 2025

Heli Ben-Hamu, Itai Gat, Daniel Severo, Niklas Nolte, and Brian Karrer. Accelerated sampling from masked diffusion models via entropy bounded unmasking.arXiv preprint arXiv:2505.24857, 2025

work page arXiv 2025

[3] [3]

Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

dparallel: Learnable parallel decoding for dllms.arXiv preprint arXiv:2509.26488, 2025

Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, and Xinchao Wang. dparallel: Learnable parallel decoding for dllms.arXiv preprint arXiv:2509.26488, 2025

work page arXiv 2025

[6] [6]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[7] [7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models.arXiv preprint arXiv:2410.17891, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

S., Seo, J.-s., Zhang, Z., and Gupta, U

Zhanqiu Hu, Jian Meng, Yash Akhauri, Mohamed S Abdelfattah, Jae-sun Seo, Zhiru Zhang, and Udit Gupta. Flashdlm: Accelerating diffusion language model inference via efficient kv caching and guided diffusion.arXiv preprint arXiv:2505.21467, 2025

work page arXiv 2025

[12] [12]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023

[13] [13]

Flex- prefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766,

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. Flexprefill: A context- aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766, 2025

work page arXiv 2025

[14] [14]

Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022

Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022

work page 2022

[15] [15]

Liu, J., Dong, X., Ye, Z., Mehta, R., Fu, Y ., Singh, V ., Kautz, J., Zhang, C., and Molchanov, P

Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, and Linfeng Zhang. dllm-cache: Accelerating diffusion large language models with adaptive caching.arXiv preprint arXiv:2506.06295, 2025

work page arXiv 2025

[16] [16]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

dkv-cache: The cache for diffusion language models.arXiv preprint arXiv:2505.15781, 2025

Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. dkv-cache: The cache for diffusion language models.arXiv preprint arXiv:2505.15781, 2025. 10

work page arXiv 2025

[18] [18]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Attention sinks in diffusion language models.arXiv preprint arXiv:2510.15731, 2025

Maximo Eduardo Rulli, Simone Petruzzi, Edoardo Michielon, Fabrizio Silvestri, Simone Scardapane, and Alessio Devoto. Attention sinks in diffusion language models.arXiv preprint arXiv:2510.15731, 2025

work page arXiv 2025

[21] [21]

Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

work page 2024

[22] [22]

Sparse-dllm: Accelerating diffusion llms with dynamic cache eviction

Yuerong Song, Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, and Xipeng Qiu. Sparse-dllm: Accelerating diffusion llms with dynamic cache eviction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33038–33046, 2026

work page 2026

[23] [23]

Diffusion llms can do faster-than-ar inference via discrete diffusion forcing.arXiv preprint arXiv:2508.09192, 2025

Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, and Zhijie Deng. Diffusion llms can do faster-than-ar inference via discrete diffusion forcing.arXiv preprint arXiv:2508.09192, 2025

work page arXiv 2025

[24] [24]

Sparsed: Sparse attention for diffusion language models.arXiv preprint arXiv:2509.24014, 2025

Zeqing Wang, Gongfan Fang, Xinyin Ma, Xingyi Yang, and Xinchao Wang. Sparsed: Sparse attention for diffusion language models.arXiv preprint arXiv:2509.24014, 2025

work page arXiv 2025

[25] [25]

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

Haocheng Xi, Harman Singh, Yuezhou Hu, Coleman Hooper, Rishabh Tiwari, Aditya Tomar, Minjae Lee, Wonjun Kang, Michael Mahoney, Chenfeng Xu, et al. Losa: Locality aware sparse attention for block-wise diffusion language models.arXiv preprint arXiv:2604.12056, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33: 17283–17297, 2020

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33: 17283–17297, 2020

work page 2020

[30] [30]

H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

work page 2023

[31] [31]

A reparameterized discrete diffusion model for text generation.arXiv preprint arXiv:2302.05737, 2023

Lin Zheng, Jianbo Yuan, Lei Yu, and Lingpeng Kong. A reparameterized discrete diffusion model for text generation.arXiv preprint arXiv:2302.05737, 2023

work page arXiv 2023

[32] [32]

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, et al. Llada 1.5: Variance-reduced preference optimization for large language diffusion models.arXiv preprint arXiv:2505.19223, 2025. 11 A Additional Visualizations of Column-Sparse Patterns Figure 7 provides additional attention visuali...

work page internal anchor Pith review Pith/arXiv arXiv 2025