pith. machine review for the scientific record. sign in

arxiv: 2604.17834 · v1 · submitted 2026-04-20 · 💻 cs.DC

Recognition: unknown

AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:31 UTC · model grok-4.3

classification 💻 cs.DC
keywords SpMMGPU accelerationasynchronous executionTMAwarp specializationBCSRWCSRsparse linear algebra
0
0 comments X

The pith

Asynchronous GPU features power new SpMM kernels that beat existing libraries by up to 6x.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops two kernels that use modern GPU asynchronous hardware to speed up sparse matrix-matrix multiplication. For block-structured sparsity, BCSR organizes data so a warp-specialized pipeline can overlap TMA transfers with WGMMA computation. For irregular cases, WCSR loads via TMA and splits row windows across blocks to balance work. These changes deliver 1.47x gains over the prior best kernel and 6.24x over cuSPARSE on standard benchmarks, plus 2.66x end-to-end improvement in large-model inference. Faster SpMM directly shortens runtimes for many scientific and machine-learning workloads that rely on sparse operations.

Core claim

The authors establish that co-designed BCSR and WCSR kernels, which exploit TMA asynchronous data movement and warp specialization to hide latency, outperform prior SpMM implementations. WCSR achieves 1.47x over AccSpMM and 6.24x over cuSPARSE across SuiteSparse matrices, while BCSR yields a combined 2.66x end-to-end speedup on Qwen2.5-7B prefill at 90 percent block sparsity with 64K tokens relative to cuDNN and cuBLAS.

What carries the argument

Warp-specialized producer-consumer pipeline that overlaps TMA data transfers with WGMMA computation, using BCSR format for structured sparsity and WCSR format for windowed irregular sparsity with cross-block row splitting.

If this is right

  • Block-sparse LLM inference at high sparsity can run more than twice as fast end-to-end compared with dense library baselines.
  • Irregular sparse matrices from standard collections can be multiplied 1.47 times faster than the previous best kernel and over six times faster than cuSPARSE.
  • Asynchronous memory features become a primary optimization target for sparse linear-algebra kernels on current GPUs.
  • Load balancing via window splitting across thread blocks improves throughput on non-uniform sparse data.
  • New sparse storage formats aligned with hardware async primitives enable better overlap of data movement and computation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar producer-consumer pipelines could accelerate other sparse kernels such as SpMV or sparse convolutions on the same hardware.
  • Adoption of these kernels might lower overall energy use in data centers that run sparse machine-learning workloads.
  • Future GPU designs may benefit from exposing even more flexible asynchronous primitives once their value for sparse codes is shown.
  • Real-world models may increasingly adopt block or window sparsity patterns to exploit these performance gains.

Load-bearing premise

That practical sparsity patterns will match the block-structured or windowed irregular forms assumed by BCSR and WCSR, and that the asynchronous overlap adds negligible overhead on target hardware.

What would settle it

Measuring performance on a collection of matrices or model weights whose nonzero pattern is uniformly random at the same density, with no block or window structure, to determine whether the reported speedups remain or reverse.

Figures

Figures reproduced from arXiv: 2604.17834 by Huanzhi Pu, Jie Liu, Zhiru Zhang.

Figure 1
Figure 1. Figure 1: An overview of the NVIDIA Hopper (SM90) GPU architecture. Each [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of BCSR and WCSR sparse formats for the same sparse [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Execution timeline. (a) Synchronous cooperative loads: all threads [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Warp-specialization pipeline for the BCSR kernel. (a) Thread block [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of normalized speedup over cuSPARSE BELL SpMM on SuiteSparse matrices for [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance breakdown showing the cumulative impact of each [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Warp-specialized kernel throughput variance with [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: End-to-end prefill speedup over the dense baseline on Qwen2.5-7B [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
read the original abstract

Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental kernel across scientific computing and machine learning. While prior work accelerates SpMM using Tensor Cores, no existing sparse kernel exploits the asynchronous features of modern GPU architectures, such as NVIDIA's Tensor Memory Accelerator (TMA) and warp specialization. This work systematically studies how these features impact SpMM performance and introduces two co-designed kernels. For structured sparsity, we optimize a warp-specialized producer-consumer pipeline overlapping TMA data transfer with WGMMA computation using Block Compressed Sparse Row (BCSR) format. For irregular sparsity, we design a Window Compressed Sparse Row (WCSR) kernel that loads the sparse operand via TMA and splits large row-windows across thread blocks for load balancing. Our WCSR kernel outperforms all prior SpMM kernels on SuiteSparse matrices (1.47x over AccSpMM, 6.24x over cuSPARSE). Our BCSR kernel achieves a combined 2.66x end-to-end speedup on Qwen2.5-7B prefill at 90% block sparsity with 64K tokens over cuDNN/cuBLAS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces AsyncSparse, a set of SpMM kernels that exploit NVIDIA GPU asynchronous features (TMA transfers overlapped with WGMMA via warp specialization). For block-structured sparsity it uses a BCSR format with a producer-consumer pipeline; for irregular sparsity it uses a WCSR format that splits row windows across thread blocks. On SuiteSparse matrices the WCSR kernel is reported to deliver 1.47× over AccSpMM and 6.24× over cuSPARSE; the BCSR kernel yields a 2.66× end-to-end speedup on Qwen2.5-7B prefill at 90 % block sparsity with 64 K tokens.

Significance. If the performance claims are substantiated, the work provides concrete evidence that modern asynchronous GPU primitives can be profitably co-designed with sparse formats for both regular and irregular SpMM, which is relevant to large-scale ML inference and scientific computing workloads. The explicit use of TMA-WGMMA overlap and warp specialization is a timely contribution given the evolution of Hopper-and-later architectures.

major comments (3)
  1. [Evaluation / Kernel Design sections] The central claim attributes the reported speedups primarily to the exploitation of asynchronous TMA-WGMMA overlap and warp specialization. However, both kernels also introduce new compressed formats (BCSR and WCSR). No ablation is presented that holds the sparse format fixed while disabling the asynchronous components (e.g., replacing TMA with synchronous loads or removing producer-consumer pipelining). This omission makes it impossible to determine how much of the 1.47×/6.24× and 2.66× gains are due to the async features versus the format changes themselves.
  2. [Abstract and Results] The abstract and results sections state concrete speedups (1.47×, 6.24×, 2.66×) but provide no details on measurement methodology, number of runs, variance, hardware configuration, or full baseline library versions and compilation flags. Without these, the reproducibility and robustness of the performance numbers cannot be assessed.
  3. [End-to-end evaluation] The BCSR kernel's 2.66× end-to-end claim is conditioned on 90 % block sparsity with 64 K tokens. The paper does not discuss how this sparsity level is obtained in practice for Qwen2.5-7B or whether the reported gains degrade under more realistic or lower sparsity patterns.
minor comments (2)
  1. [Introduction / Kernel Design] Notation for BCSR and WCSR should be introduced with a small diagram or pseudocode in the first section where they appear, rather than only in the kernel description.
  2. [Experimental setup] The paper should cite the exact cuSPARSE, cuDNN, and cuBLAS versions used for all baselines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below and commit to revisions that will strengthen the manuscript's claims and reproducibility.

read point-by-point responses
  1. Referee: [Evaluation / Kernel Design sections] The central claim attributes the reported speedups primarily to the exploitation of asynchronous TMA-WGMMA overlap and warp specialization. However, both kernels also introduce new compressed formats (BCSR and WCSR). No ablation is presented that holds the sparse format fixed while disabling the asynchronous components (e.g., replacing TMA with synchronous loads or removing producer-consumer pipelining). This omission makes it impossible to determine how much of the 1.47×/6.24× and 2.66× gains are due to the async features versus the format changes themselves.

    Authors: We agree that an ablation isolating the asynchronous primitives from the format changes would provide stronger evidence. The BCSR and WCSR formats were co-designed specifically to enable efficient TMA usage and warp-specialized pipelining. In the revision we will add an ablation study that keeps the sparse formats fixed while replacing TMA loads with synchronous equivalents and disabling the producer-consumer pipeline, allowing direct quantification of the async contribution on the same SuiteSparse and model workloads. revision: yes

  2. Referee: [Abstract and Results] The abstract and results sections state concrete speedups (1.47×, 6.24×, 2.66×) but provide no details on measurement methodology, number of runs, variance, hardware configuration, or full baseline library versions and compilation flags. Without these, the reproducibility and robustness of the performance numbers cannot be assessed.

    Authors: We acknowledge that the current manuscript lacks sufficient methodological detail. The revised version will include an expanded evaluation subsection that reports: the exact GPU (NVIDIA H100), CUDA 12.4, number of runs (median of 10), standard deviation across runs, precise library versions (cuSPARSE 12.4, cuDNN 9.0, cuBLAS), and compilation flags (-O3 with sm_90a). Kernel timings will be explicitly defined as device-side execution time. revision: yes

  3. Referee: [End-to-end evaluation] The BCSR kernel's 2.66× end-to-end claim is conditioned on 90 % block sparsity with 64 K tokens. The paper does not discuss how this sparsity level is obtained in practice for Qwen2.5-7B or whether the reported gains degrade under more realistic or lower sparsity patterns.

    Authors: The 90 % block sparsity is produced by applying block-wise structured pruning to the Qwen2.5-7B weights. We will revise the end-to-end section to describe the pruning procedure and add results at 50 % and 70 % block sparsity (same 64 K token prefill) to illustrate how speedups vary with sparsity level, thereby addressing robustness under more realistic patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical kernel benchmarks are self-contained against external baselines

full rationale

The paper reports measured speedups from two new SpMM kernels (BCSR for structured block sparsity and WCSR for irregular cases) that exploit TMA/WGMMA asynchronous overlap on modern GPUs. All central claims are direct empirical comparisons to external libraries (cuSPARSE, cuDNN, AccSpMM) on SuiteSparse matrices and an end-to-end LLM prefill workload. No equations, fitted parameters, or predictions are presented that reduce by construction to the paper's own inputs or self-citations. The derivation chain consists of implementation choices and benchmarking, which remain falsifiable against independent hardware runs and do not invoke load-bearing self-citations or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems/engineering paper focused on kernel implementation. No mathematical free parameters, domain axioms, or invented entities are introduced beyond standard assumptions about GPU memory hierarchy and sparsity patterns.

pith-pipeline@v0.9.0 · 5501 in / 1179 out tokens · 45515 ms · 2026-05-10T04:31:40.370953+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 30 canonical work pages · 5 internal anchors

  1. [1]

    SIAM, 2 edition, 2003

    Y . Saad,Iterative Methods for Sparse Linear Systems, 2nd ed. Society for Industrial and Applied Mathematics, 2003. [Online]. Available: https://doi.org/10.1137/1.9780898718003

  2. [2]

    An overview of the sparse basic linear algebra subprograms: The new standard from the blas technical forum,

    I. S. Duff, M. A. Heroux, and R. Pozo, “An overview of the sparse basic linear algebra subprograms: The new standard from the blas technical forum,”ACM Trans. Math. Softw., vol. 28, no. 2, p. 239–267, Jun 2002. [Online]. Available: https://doi.org/10.1145/567806.567810

  3. [3]

    Mathematical foundations of the graphblas,

    J. Kepner, P. Aaltonen, D. A. Bader, A. Buluç, F. Franchetti, J. R. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenke, S. McMillan, J. E. Moreira, J. D. Owens, C. Yang, M. Zalewski, and T. G. Mattson, “Mathematical foundations of the graphblas,”CoRR, vol. abs/1606.05790, 2016. [Online]. Available: http://arxiv.org/abs/ 1606.05790

  4. [4]

    Gunrock: a high-performance graph processing library on the gpu,

    Y . Wang, A. Davidson, Y . Pan, Y . Wu, A. Riffel, and J. D. Owens, “Gunrock: a high-performance graph processing library on the gpu,” in Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’16. New York, NY , USA: Association for Computing Machinery, 2016. [Online]. Available: https://doi.org/10.11...

  5. [5]

    Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks,

    T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, “Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks,”Journal of Machine Learning Research (JMLR), vol. 22, no. 1, pp. 241:1–241:124, 2021. [Online]. Available: http://jmlr.org/papers/v22/21-0366.html

  6. [6]

    Generalized Slow Roll for Tensors

    T. Gale, M. Zaharia, C. Young, and E. Elsen, “Sparse gpu kernels for deep learning,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’20. IEEE Press, 2020. [Online]. Available: https://doi.org/10.1109/SC41405.2020.00021

  7. [7]

    Deep graph library: A graph-centric, highly-performant package for graph neural networks,

    M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu, Y . Gai, T. Xiao, T. He, G. Karypis, J. Li, and Z. Zhang, “Deep graph library: A graph-centric, highly-performant package for graph neural networks,” 2020. [Online]. Available: https://arxiv.org/abs/1909.01315

  8. [8]

    Block-sparse recurrent neural networks,

    S. Narang, E. Undersander, and G. Diamos, “Block-sparse recurrent neural networks,” 2017. [Online]. Available: https://arxiv.org/abs/1711. 02782

  9. [9]

    Gpu kernels for block-sparse weights,

    S. Gray, A. Radford, and D. P. Kingma, “Gpu kernels for block-sparse weights,” 2017. [Online]. Available: https://cdn.openai.com/blocksparse/ blocksparsepaper.pdf

  10. [10]

    Enabling unstructured sparse acceleration on structured sparse accelerators,

    G. Jeong, P.-A. Tsai, A. R. Bambhaniya, S. W. Keckler, and T. Krishna, “Enabling unstructured sparse acceleration on structured sparse accelerators,” inEighth Conference on Machine Learning and Systems, 2025. [Online]. Available: https://openreview.net/forum?id= Py0XA6QQAh

  11. [11]

    Sparsegpt: Massive language models can be accurately pruned in one-shot,

    E. Frantar and D. Alistarh, “Sparsegpt: Massive language models can be accurately pruned in one-shot,” inProceedings of the 40th International Conference on Machine Learning (ICML), 2023, pp. 10 323–10 337. [Online]. Available: https://proceedings.mlr.press/v202/frantar23a.html

  12. [12]

    A simple and effective pruning approach for large language models,

    M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A simple and effective pruning approach for large language models,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=PxoFut3dWW

  13. [13]

    TC-GNN: Bridging sparse GNN computation and dense tensor cores on GPUs,

    Y . Wang, B. Feng, Z. Wang, G. Huang, and Y . Ding, “TC-GNN: Bridging sparse GNN computation and dense tensor cores on GPUs,” in 2023 USENIX Annual Technical Conference (USENIX ATC 23). Boston, MA: USENIX Association, Jul 2023, pp. 149–164. [Online]. Available: https://www.usenix.org/conference/atc23/presentation/wang-yuke

  14. [14]

    Gregory Pauloski, et al

    J. Shi, S. Li, Y . Xu, R. Fu, X. Wang, and T. Wu, “Flashsparse: Minimizing computation redundancy for fast sparse matrix multiplications on tensor cores,” inProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 312–325. [Onl...

  15. [15]

    Tensormd: Molecular dynamics simulation with ab initio accuracy of 50 billion atoms,

    H. Zhao, S. Li, J. Wang, C. Zhou, J. Wang, Z. Xin, S. Li, Z. Liang, Z. Pan, F. Liu, Y . Zeng, Y . Wang, and X. Chi, “Acc-spmm: Accelerating general-purpose sparse matrix-matrix multiplication with gpu tensor cores,” inProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’25. New York, NY , USA...

  16. [16]

    Dissecting the nvidia hopper architecture through microbenchmarking and multiple level analysis,

    W. Luo, R. Fan, Z. Li, D. Du, H. Liu, Q. Wang, and X. Chu, “Dissecting the nvidia hopper architecture through microbenchmarking and multiple level analysis,” 2025. [Online]. Available: https: //arxiv.org/abs/2501.12084

  17. [17]

    Cusparse library,

    M. Naumov, L. Chien, P. Vandermersch, and U. Kapasi, “Cusparse library,” inGPU technology conference, vol. 12, 2010

  18. [18]

    TorchAO: Pytorch-native training-to-serving model optimization,

    A. Or, A. Jain, D. Vega-Myhre, J. Cai, C. D. Hernandez, Z. Zhang, D. Guessous, V . Kuznetsov, C. Puhrsch, M. Saroufim, and S. Rao, “TorchAO: Pytorch-native training-to-serving model optimization,” inChampioning Open-source DEvelopment in ML Workshop @ ICML25, 2025. [Online]. Available: https://openreview.net/forum?id= HpqH0JakHf

  19. [19]

    Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity,

    H. Xia, Z. Zheng, Y . Li, D. Zhuang, Z. Zhou, X. Qiu, Y . Li, W. Lin, and S. L. Song, “Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity,”Proc. VLDB Endow., vol. 17, no. 2, p. 211–224, Oct 2023. [Online]. Available: https://doi.org/10.14778/3626292.3626303

  20. [20]

    Dtc-spmm: Bridging the gap in accelerating general sparse matrix multiplication with tensor cores,

    R. Fan, W. Wang, and X. Chu, “Dtc-spmm: Bridging the gap in accelerating general sparse matrix multiplication with tensor cores,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 253–2...

  21. [21]

    Outperforming cuBLAS on H100: A worklog,

    P. Shankhdhar, “Outperforming cuBLAS on H100: A worklog,” https: //cudaforfun.substack.com/p/outperforming-cublas-on-h100-a-worklog, 2024

  22. [22]

    Generalized Slow Roll for Tensors

    G. Huang, G. Dai, Y . Wang, and H. Yang, “Ge-spmm: General-purpose sparse matrix-matrix multiplication on gpus for graph neural networks,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020, pp. 1–12. [Online]. Available: https://doi.org/10.1109/SC41405.2020.00076

  23. [23]

    Fastspmm: Leveraging tensor cores for sparse matrix multiplication,

    H. Wang, M. Li, W. Jia, H. Yang, and G. Tan, “Fastspmm: Leveraging tensor cores for sparse matrix multiplication,” inProceedings of the 22nd ACM International Conference on Computing Frontiers, ser. CF ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 195–204. [Online]. Available: https://doi.org/10.1145/3719276.3725173

  24. [24]

    Accelerating gnns on gpu sparse tensor cores through n:m sparsity-oriented graph reordering,

    J.-A. Chen, H.-H. Sung, R. Zhang, A. Li, and X. Shen, “Accelerating gnns on gpu sparse tensor cores through n:m sparsity-oriented graph reordering,” inProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 16–28. [Online]. Av...

  25. [25]

    Bridging the gap between unstructured spmm and structured sparse tensor cores,

    Y . Dong, Z. Shen, W. Jiang, Z. Liu, Y . Xu, B. He, R. Zheng, and H. Jin, “Bridging the gap between unstructured spmm and structured sparse tensor cores,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 645–660. [O...

  26. [26]

    Jigsaw: Accelerating spmm with vector sparsity on sparse tensor core,

    K. Zhang, X. Liu, H. Yang, T. Feng, X. Yang, Y . Liu, Z. Luan, and D. Qian, “Jigsaw: Accelerating spmm with vector sparsity on sparse tensor core,” inProceedings of the 53rd International Conference on Parallel Processing, ser. ICPP ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 1124–1134. [Online]. Available: https://doi.org/10.11...

  27. [27]

    The tensor algebra compiler,

    F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe, “The tensor algebra compiler,”Proc. ACM Program. Lang., vol. 1, no. OOPSLA, Oct 2017. [Online]. Available: https://doi.org/10.1145/ 3133901

  28. [28]

    Sparsetir: Composable abstractions for sparse compilation in deep learning,

    Z. Ye, R. Lai, J. Shao, T. Chen, and L. Ceze, “Sparsetir: Composable abstractions for sparse compilation in deep learning,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS 2023. New York, NY , USA: Association for Computing Machinery, 2023, p. 660–678. [...

  29. [29]

    Unisparse: An intermediate language for general sparse format customization,

    J. Liu, Z. Zhao, Z. Ding, B. Brock, H. Rong, and Z. Zhang, “Unisparse: An intermediate language for general sparse format customization,” Proc. ACM Program. Lang., vol. 8, no. OOPSLA1, Apr 2024. [Online]. Available: https://doi.org/10.1145/3649816

  30. [30]

    Splat: A framework for optimised gpu code-generation for sparse regular attention,

    A. Gupta, Y . Yuan, D. Jain, Y . Ge, D. Aponte, Y . Zhou, and C. Mendis, “Splat: A framework for optimised gpu code-generation for sparse regular attention,”Proc. ACM Program. Lang., vol. 9, no. OOPSLA1, Apr 2025. [Online]. Available: https://doi.org/10.1145/3720503

  31. [31]

    ISBN 9781450367196

    P. Tillet, H. T. Kung, and D. Cox, “Triton: an intermediate language and compiler for tiled neural network computations,” inProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, ser. MAPL 2019. New York, NY , USA: Association for Computing Machinery, 2019, p. 10–19. [Online]. Available: https://doi.org/10...

  32. [32]

    Tawa: Automatic warp specialization for modern gpus with asynchronous references,

    H. Chen, B. Fan, A. Collins, B. Hagedorn, E. Gaburov, M. Masuda, M. Brookhart, C. Sullivan, J. Knight, Z. Zhang, and V . Grover, “Tawa: Automatic warp specialization for modern gpus with asynchronous references,” 2025. [Online]. Available: https://arxiv.org/abs/2510.14719

  33. [33]

    Optimal brain damage,

    Y . LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” inNIPS, 1989, pp. 598–605. [Online]. Available: http: //papers.nips.cc/paper/250-optimal-brain-damage

  34. [34]

    Optimal brain surgeon: Extensions and performance comparison,

    B. Hassibi, D. G. Stork, and G. J. Wolff, “Optimal brain surgeon: Extensions and performance comparison,” inNIPS, 1993, pp. 263–270. [Online]. Available: http://papers.nips.cc/paper/ 749-optimal-brain-surgeon-extensions-and-performance-comparisons

  35. [35]

    Learning both weights and connections for efficient neural network,

    S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connections for efficient neural network,” inNIPS, 2015, pp. 1135–1143. [Online]. Available: http://papers.nips.cc/paper/ 5784-learning-both-weights-and-connections-for-efficient-neural-network

  36. [36]

    Chasing sparsity in vision transformers: An end- to-end exploration,

    T. Chen, Y . Cheng, Z. Gan, L. Yuan, L. Zhang, and Z. Wang, “Chasing sparsity in vision transformers: An end- to-end exploration,” inNeurIPS, 2021, pp. 19 974–19 988. [Online]. Available: https://proceedings.neurips.cc/paper/2021/hash/ a61f27ab2165df0e18cc9433bd7f27c5-Abstract.html

  37. [37]

    SliceGPT: Compress large language models by deleting rows and columns,

    S. Ashkboos, M. L. Croci, M. G. do Nascimento, T. Hoefler, and J. Hensman, “SliceGPT: Compress large language models by deleting rows and columns,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=vXxardq6db

  38. [38]

    The sparse frontier: Sparse attention trade-offs in transformer llms,

    P. Nawrot, R. Li, R. Huang, S. Ruder, K. Marchisio, and E. M. Ponti, “The sparse frontier: Sparse attention trade-offs in transformer llms,”

  39. [39]

    Nawrot, R

    [Online]. Available: https://arxiv.org/abs/2504.17768

  40. [40]

    Efficient Streaming Language Models with Attention Sinks

    G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,” 2024. [Online]. Available: https://arxiv.org/abs/2309.17453

  41. [41]

    Longformer: The Long-Document Transformer

    I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- document transformer,” 2020. [Online]. Available: https://arxiv.org/abs/ 2004.05150

  42. [42]

    Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428,

    R. Xu, G. Xiao, H. Huang, J. Guo, and S. Han, “Xattention: Block sparse attention with antidiagonal scoring,” 2025. [Online]. Available: https://arxiv.org/abs/2503.16428

  43. [43]

    An efficient training algorithm for models with block-wise sparsity,

    D. Zhu, Z. Zuo, and M. M. Khalili, “An efficient training algorithm for models with block-wise sparsity,” 2025. [Online]. Available: https://arxiv.org/abs/2503.21928

  44. [44]

    Thanos: A block-wise pruning algorithm for efficient large language model compression,

    I. Ilin and P. Richtarik, “Thanos: A block-wise pruning algorithm for efficient large language model compression,” 2025. [Online]. Available: https://arxiv.org/abs/2504.05346

  45. [45]

    MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention,

    H. Jiang, Y . Li, C. Zhang, Q. Wu, X. Luo, S. Ahn, Z. Han, A. H. Abdi, D. Li, C.-Y . Lin, Y . Yang, and L. Qiu, “MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id=fPBACAbqSN

  46. [46]

    Model-based memory hierarchy optimizations for sparse matrices,

    E.-J. Im, “Model-based memory hierarchy optimizations for sparse matrices,” 2007. [Online]. Available: https://api.semanticscholar.org/ CorpusID:14967653

  47. [47]

    Cuthill, J

    E. Cuthill and J. McKee, “Reducing the bandwidth of sparse symmetric matrices,” inProceedings of the 1969 24th National Conference, ser. ACM ’69. New York, NY , USA: Association for Computing Machinery, 1969, p. 157–172. [Online]. Available: https://doi.org/10.1145/800195.805928

  48. [48]

    Davis and Yifan Hu

    T. A. Davis and Y . Hu, “The university of florida sparse matrix collection,”ACM Trans. Math. Softw., vol. 38, no. 1, Dec 2011. [Online]. Available: https://doi.org/10.1145/2049662.2049663

  49. [49]

    Qwen2.5 technical report,

    Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, ...

  50. [50]

    Qwen2.5 Technical Report

    [Online]. Available: https://arxiv.org/abs/2412.15115