arxiv: 2604.19004 · v1 · submitted 2026-04-21 · 💻 cs.DC · cs.MS

Recognition: unknown

Ocean: Fast Estimation-Based Sparse General Matrix-Matrix Multiplication on GPU

Yifan Li , Giulia Guidi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:34 UTC · model grok-4.3

classification 💻 cs.DC cs.MS

keywords SpGEMMGPUHyperLogLogsparse matrix multiplicationestimationhybrid accumulatorsymbolic passworkload imbalance

0 comments

The pith

Estimation replaces exact symbolic passes to speed up GPU SpGEMM computations

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the symbolic pass in two-pass GPU SpGEMM, which determines output row sizes and accounts for roughly 28 percent of runtime, can be replaced by lightweight HyperLogLog estimators. This change, paired with dynamic workflow selection and a hybrid accumulator that mixes shared and global memory hashing, maintains result correctness while cutting overhead in irregular sparse computations. Sympathetic readers would care because SpGEMM supports key workloads in simulations, graph analytics, and machine learning, where memory irregularity currently limits GPU efficiency. The approach delivers measured speedups of 1.4x to 2.8x over leading implementations on A100 and H100 GPUs for both square and rectangular matrices.

Core claim

Our approach replaces the costly symbolic step with lightweight HyperLogLog estimators, combined with an analysis strategy that dynamically selects the optimal workflow and guides accumulator configuration. In addition, we introduce a hybrid accumulator design, including a novel hash-based accumulator that leverages both shared and global memory. This estimation-based SpGEMM workflow consistently outperforms leading GPU SpGEMM implementations across a wide range of both square and rectangular matrices.

What carries the argument

HyperLogLog cardinality estimators that approximate output row sizes to replace exact symbolic counting, driving dynamic workflow choice and configuration of a hybrid shared-plus-global hash accumulator.

Load-bearing premise

HyperLogLog estimation error stays small enough that the selected workflow and accumulator still produce correct results and do not introduce hidden overhead that cancels the reported gains for the tested matrix distributions.

What would settle it

Measure output correctness and total runtime when the method is applied to matrices engineered to produce large HyperLogLog errors, such as highly skewed row-density distributions, and check whether speedups vanish or errors appear.

Figures

Figures reproduced from arXiv: 2604.19004 by Giulia Guidi, Yifan Li.

**Figure 2.** Figure 2: Common GPU SpGEMM Workflow. a unified L2 cache. Global memory is shared by SMs and is much larger but significantly slower. Because accumulators are accessed frequently, they are typically placed in scratchpad memory. The simplest design is the dense accumulator, in which the entire output row is represented as a dense array [18]. This approach provides fast access but is impractical for many matrices beca… view at source ↗

**Figure 3.** Figure 3: The estimation of per-row nonzeros using Hyper [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the Ocean SpGEMM workflow, with estimation-based components highlighted in darker backgrounds. The analysis step gathers metrics for workflow selection and load balancing. Then, size prediction uses either symbolic accumulation or estimation to predict per-row output sizes. Finally, numeric computation performs the multiplication using the collected statistics, followed by postprocessing t… view at source ↗

**Figure 5.** Figure 5: Computation of Input Expansion Ratio (𝐸𝑅) and Output Compression Ratio (𝐶𝑅). 𝐸𝑅 denotes the ratio between the number of intermediate products and the number of nonzeros in input matrix A. 𝐶𝑅 denotes the ratio between the number of intermediate products and the number of nonzeros in output matrix C. SpGEMM workflow and its key components to support estimationdriven execution [PITH_FULL_IMAGE:figures/full… view at source ↗

**Figure 6.** Figure 6: The smoothed GFLOPS achieved over the square [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Relative speedup of SpGEMM solutions compared to [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: CDF of the average relative estimation error of HLL [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Runtime breakdown of individual components [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

read the original abstract

In computational science and data analytics, many workloads involve irregular and sparse computations that are inherently difficult to optimize for modern hardware. A key kernel is Sparse General Matrix-Matrix Multiplication (SpGEMM), which underpins simulations, graph analytics, and machine learning applications. SpGEMM exhibits irregular memory access patterns and workload imbalance, making it challenging to achieve high performance on GPUs. Current GPU SpGEMM solutions typically rely on a two-pass workflow to address load imbalance and reduce memory access. The symbolic pass, which determines the number of output elements per row, accounts for roughly 28% of the total runtime on average. In this work, we question the necessity of exact symbolic computation and introduce an estimation-based SpGEMM workflow. Our approach replaces the costly symbolic step with lightweight HyperLogLog estimators, combined with an analysis strategy that dynamically selects the optimal workflow and guides accumulator configuration. In addition, we introduce a hybrid accumulator design, including a novel hash-based accumulator that leverages both shared and global memory. Our approach consistently outperforms leading GPU SpGEMM implementations across a wide range of both square and rectangular matrices, achieving speedups of 1.4x-2.8x on NVIDIA A100 and H100 architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ocean swaps the exact symbolic pass in GPU SpGEMM for HyperLogLog estimates plus a hybrid shared-global hash accumulator and reports 1.4-2.8x speedups, but the abstract leaves the estimator's accuracy and failure modes unexamined.

read the letter

This paper's main contribution is replacing the exact symbolic computation in two-pass GPU SpGEMM with lightweight HyperLogLog cardinality estimates, then using those estimates to drive dynamic workflow selection and configure a hybrid hash accumulator that mixes shared and global memory. They claim this yields consistent 1.4x-2.8x speedups over existing GPU implementations on A100 and H100 for both square and rectangular matrices. The motivation is straightforward: the symbolic pass accounts for roughly 28% of runtime on average, so skipping it is attractive if the estimates are good enough to avoid wrong results or extra overhead. The hybrid accumulator and dynamic selector are practical additions that target load imbalance without requiring another full pass. Those pieces are new relative to the two-pass baselines cited in the abstract. The empirical numbers are concrete and cover a range of matrix shapes, which is better than many kernel papers that test only one regime. The soft spot is the missing validation around the estimator itself. Nothing in the provided text reports HyperLogLog error rates, how often the dynamic selector picks a suboptimal path, or what happens on high-variance row distributions where underestimation could truncate output or force hidden fallbacks. The stress-test concern about under-allocation is reasonable based on what is shown; without those measurements the net-gain claim rests on unverified assumptions. This work is aimed at systems researchers and library developers who implement or tune sparse matrix kernels for graph analytics, scientific computing, or sparse ML. A reader who cares about GPU performance engineering would find the workflow and accumulator design worth examining. It deserves a serious referee because the core change from exact to estimated is distinct enough and the performance claims are specific enough to check, even if the probabilistic side needs more scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper introduces Ocean, an estimation-based SpGEMM algorithm for GPUs that replaces the exact symbolic pass (averaging 28% of runtime) with HyperLogLog cardinality estimators, adds dynamic workflow selection to choose optimal paths and accumulator configurations, and proposes a hybrid hash-based accumulator that uses both shared and global memory. It reports consistent speedups of 1.4x–2.8x versus leading GPU SpGEMM implementations on NVIDIA A100 and H100 GPUs across square and rectangular matrices.

Significance. If the HyperLogLog estimates maintain sufficient accuracy to guarantee correct output nnz counts and avoid hidden fallback overhead, the work would meaningfully advance GPU sparse linear algebra by eliminating a major source of overhead in irregular workloads. This could benefit graph analytics, scientific simulations, and ML kernels that rely on SpGEMM. The empirical timing comparisons to external baselines constitute the primary evidence; no parameter-free derivations or machine-checked proofs are present.

major comments (2)

[Abstract] Abstract: the central claim that the estimation-based workflow 'consistently outperforms' leading implementations with 1.4x–2.8x speedups rests on the unverified assumption that HyperLogLog error remains small enough for the dynamic selector and hybrid accumulator to provision correct capacity. No quantitative error rates, failure-case analysis, or worst-case behavior for high-variance row distributions are supplied, so the correctness-to-performance link cannot be assessed from the given text.
[Abstract] Abstract and evaluation description: the reported speedups are presented without accompanying data on estimation overhead, the fraction of cases where the dynamic selector chooses a sub-optimal path, or full benchmark tables that would allow verification that net gains are not canceled by under-allocation or fallback costs.

minor comments (1)

The abstract refers to 'a wide range of both square and rectangular matrices' without naming the specific test suites or matrix properties (e.g., nnz distribution, aspect ratios), which would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment point by point below. Where the feedback identifies gaps in the presentation of supporting data or analysis, we have revised the manuscript to incorporate the requested information.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the estimation-based workflow 'consistently outperforms' leading implementations with 1.4x–2.8x speedups rests on the unverified assumption that HyperLogLog error remains small enough for the dynamic selector and hybrid accumulator to provision correct capacity. No quantitative error rates, failure-case analysis, or worst-case behavior for high-variance row distributions are supplied, so the correctness-to-performance link cannot be assessed from the given text.

Authors: We agree that the abstract as written does not supply quantitative error rates or worst-case analysis, making it difficult for readers to directly assess the assumption. The full manuscript reports end-to-end timings that implicitly reflect any estimation inaccuracies (as incorrect capacity would trigger fallback or incorrect results), but we acknowledge this is insufficient. We will revise the abstract to include a concise statement on observed estimator accuracy and add a new subsection in the evaluation that provides quantitative error rates, failure-case analysis, and behavior for high-variance row distributions. revision: yes
Referee: [Abstract] Abstract and evaluation description: the reported speedups are presented without accompanying data on estimation overhead, the fraction of cases where the dynamic selector chooses a sub-optimal path, or full benchmark tables that would allow verification that net gains are not canceled by under-allocation or fallback costs.

Authors: The evaluation section already measures and reports component runtimes that include estimation overhead and dynamic selector decisions. However, we accept that the abstract and main text do not present these breakdowns or full per-matrix tables explicitly enough to let readers verify net gains independently. We will expand the evaluation section with additional tables and figures showing estimation overhead percentages, the fraction of sub-optimal selector choices, and complete benchmark results per matrix to confirm that reported speedups are not offset by under-allocation or fallback costs. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical SpGEMM estimation workflow

full rationale

The paper replaces the exact symbolic pass in GPU SpGEMM with HyperLogLog cardinality estimation, adds dynamic workflow selection, and a hybrid hash-based accumulator, then reports empirical speedups of 1.4x-2.8x versus external baselines on A100/H100. No mathematical derivation, prediction, or first-principles result is claimed; the central performance claims rest on direct timing measurements against independent implementations rather than any fitted parameter renamed as output, self-citation chain, or self-definitional reduction. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The method assumes HyperLogLog cardinality estimation is accurate enough for row non-zero counts in typical sparse matrices and that the dynamic selector can reliably pick the best accumulator strategy from the estimate; both are domain assumptions rather than proven invariants.

free parameters (2)

HyperLogLog precision bits
Controls estimation accuracy versus memory and speed; must be chosen to keep error low enough for the downstream workflow decisions.
Dynamic selection thresholds
Cutoffs that decide when to use estimation versus fallback or which accumulator variant to invoke; these are tuned on the benchmark set.

pith-pipeline@v0.9.0 · 5513 in / 1267 out tokens · 37799 ms · 2026-05-10T02:34:10.270930+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 19 canonical work pages

[1]

Rasmus Resen Amossen, Andrea Campagna, and Rasmus Pagh. 2014. Better Size Estimation for Sparse Matrix Products.Algorithmica69, 3 (July 2014), 741–757. doi:10.1007/s00453-012-9692-9

work page doi:10.1007/s00453-012-9692-9 2014
[2]

Pham Nguyen Quang Anh, Rui Fan, and Yonggang Wen. 2016. Balanced hashing and efficient gpu sparse general matrix-matrix multiplication. InProceedings of the 2016 International Conference on Supercomputing. 1–12

2016
[3]

Ariful Azad, Georgios A Pavlopoulos, Christos A Ouzounis, Nikos C Kyrpides, and Aydin Buluç. 2018. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks.Nucleic acids research 46, 6 (2018), e33–e33

2018
[4]

Nathan Bell, Steven Dalton, and Luke N. Olson. 2012. Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods.SIAM Journal on Scientific Computing 34, 4 (2012), C123–C152. doi:10.1137/110838844

work page doi:10.1137/110838844 2012
[5]

Julian Bellavita, Lorenzo Pichetti, Thomas Pasquali, Flavio Vella, and Giulia Guidi
[6]

InProceedings of the 2026 International Conference on Supercomputing (ICS ’26)(Belfast, United Kingdom)

Communication-Avoiding SpGEMM via Trident Partitioning on Hierar- chical GPU Interconnects. InProceedings of the 2026 International Conference on Supercomputing (ICS ’26)(Belfast, United Kingdom). ACM, New York, NY, USA, 1–13. doi:10.1145/3797905.3800543

work page doi:10.1145/3797905.3800543 2026
[7]

Benjamin Brock, Aydın Buluç, and Katherine Yelick. 2024. RDMA-based Algo- rithms for Sparse Matrix Multiplication on GPUs. InProceedings of the 38th ACM International Conference on Supercomputing. 225–235

2024
[8]

Aydın Buluç. 2025. The Ubiquitous Sparse Matrix-Matrix Products. (2025). arXiv:2508.04077 [math.NA] https://arxiv.org/abs/2508.04077

work page arXiv 2025
[9]

Edith Cohen. 1996. On optimizing multiplications of sparse matrices. InInter- national Conference on Integer Programming and Combinatorial Optimization. Springer, 219–233

1996
[10]

Edith Cohen. 1997. Size-Estimation Framework with Applications to Transitive Closure and Reachability.J. Comput. System Sci.55, 3 (Dec. 1997), 441–453. doi:10.1006/jcss.1997.1534

work page doi:10.1006/jcss.1997.1534 1997
[11]

Steven Dalton, Sean Baxter, Duane Merrill, Luke Olson, and Michael Garland
[12]

In2015 IEEE International Parallel and Distributed Processing Symposium

Optimizing sparse matrix operations on gpus using merge path. In2015 IEEE International Parallel and Distributed Processing Symposium. IEEE, 407–416
[13]

Zhaoyang Du, Yijin Guan, Tianchan Guan, Dimin Niu, Linyong Huang, Hongzhong Zheng, and Yuan Xie. 2022. OpSparse: A Highly Optimized Frame- work for Sparse General Matrix Multiplication on GPUs.IEEE Access10 (2022), 85960–85974. doi:10.1109/ACCESS.2022.3196940

work page doi:10.1109/access.2022.3196940 2022
[14]

Zhaoyang Du, Yijin Guan, Tianchan Guan, Dimin Niu, Nianxiong Tan, Xiaopeng Yu, Hongzhong Zheng, Jianyi Meng, Xiaolang Yan, and Yuan Xie. 2022. Predicting the Output Structure of Sparse Matrix Multiplication with Sampled Compression Ratio. doi:10.48550/arXiv.2207.13848 arXiv:2207.13848 [cs]

work page doi:10.48550/arxiv.2207.13848 2022
[15]

Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyper- LogLog: the analysis of a near-optimal cardinality estimation algorithm.Discrete Mathematics & Theoretical Computer ScienceDMTCS Proceedings vol. AH,..., Proceedings (Jan. 2007). doi:10.46298/dmtcs.3545 Publisher: Centre pour la Communication Scientifique Directe (CCSD). Oc...

work page doi:10.46298/dmtcs.3545 2007
[16]

Jianhua Gao, Weixing Ji, Fangli Chang, Shiyu Han, Bingxin Wei, Zeming Liu, and Yizhuo Wang. 2023. A Systematic Survey of General Sparse Matrix-matrix Multiplication.ACM Comput. Surv.55, 12 (Dec. 2023), 1–36. doi:10.1145/3571157

work page doi:10.1145/3571157 2023
[17]

John R Gilbert, Cleve Moler, and Robert Schreiber. 1992. Sparse matrices in MATLAB: Design and implementation.SIAM journal on matrix analysis and applications13, 1 (1992), 333–356

1992
[18]

Giulia Guidi, Marquita Ellis, Daniel Rokhsar, Katherine Yelick, and Aydın Buluç
[19]

InSIAM Conference on Applied and Computational Discrete Algorithms (ACDA21)

BELLA: Berkeley efficient long-read to long-read aligner and overlapper. InSIAM Conference on Applied and Computational Discrete Algorithms (ACDA21). SIAM, 123–134
[20]

Giulia Guidi, Oguz Selvitopi, Marquita Ellis, Leonid Oliker, Katherine Yelick, and Aydın Buluç. 2021. Parallel string graph construction and transitive reduction for de novo genome assembly. In2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 517–526

2021
[21]

Fred G Gustavson. 1978. Two fast algorithms for sparse matrices: Multiplication and permuted transposition.ACM Transactions on Mathematical Software (TOMS) 4, 3 (1978), 250–269

1978
[22]

Stefan Heule, Marc Nunkesser, and Alexander Hall. 2013. HyperLogLog in prac- tice: algorithmic engineering of a state of the art cardinality estimation algorithm. InProceedings of the 16th International Conference on Extending Database Tech- nology. ACM, Genoa Italy, 683–692. doi:10.1145/2452376.2452456

work page doi:10.1145/2452376.2452456 2013
[23]

Abdullah Al Raqibul Islam, Helen Xu, Dong Dai, and Aydın Buluç. 2025. Im- proving SpGEMM Performance Through Matrix Reordering and Cluster-wise Computation. doi:10.48550/arXiv.2507.21253 arXiv:2507.21253 [cs]

work page doi:10.48550/arxiv.2507.21253 2025
[24]

Scarpazza

Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. doi:10. 48550/arXiv.1804.06826 arXiv:1804.06826 [cs]

work page arXiv 2018
[25]

1998.The Art of Computer Programming: Sorting and Searching, Volume 3

Donald E Knuth. 1998.The Art of Computer Programming: Sorting and Searching, Volume 3. Addison-Wesley Professional

1998
[26]

Weifeng Liu and Brian Vinter. 2014. An efficient GPU general sparse matrix- matrix multiplication for irregular data. In2014 IEEE 28th international parallel and distributed processing symposium. IEEE, 370–381

2014
[27]

Yusuke Nagasaka, Satoshi Matsuoka, Ariful Azad, and Aydın Buluç. 2019. Perfor- mance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors.Parallel Comput.90 (2019), 102545

2019
[28]

Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2017. High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU. In2017 46th International Conference on Parallel Processing (ICPP). IEEE, Bristol, United Kingdom, 101–110. doi:10.1109/ICPP.2017.19

work page doi:10.1109/icpp.2017.19 2017
[29]

Yuyao Niu, Zhengyang Lu, Haonan Ji, Shuhui Song, Zhou Jin, and Weifeng Liu
[30]

InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

TileSpGEMM: a tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, Seoul Republic of Korea, 90–106. doi:10.1145/3503221.3508431

work page doi:10.1145/3503221.3508431
[31]

NVIDIA Corporation. 2026. cuSPARSE Library. https://docs.nvidia.com/cuda/ cusparse/. Version 13.1

2026
[32]

2026.Volta Tuning Guide Release 13.1

NVIDIA Corporation. 2026.Volta Tuning Guide Release 13.1. NVIDIA Corporation. https://docs.nvidia.com/cuda/volta-tuning-guide/index.html

2026
[33]

Mathias Parger, Martin Winter, Daniel Mlakar, and Markus Steinberger. 2020. spECK: accelerating GPU sparse matrix-matrix multiplication through light- weight analysis. InProceedings of the 25th ACM SIGPLAN Symposium on Princi- ples and Practice of Parallel Programming. ACM, San Diego California, 362–375. doi:10.1145/3332466.3374521

work page doi:10.1145/3332466.3374521 2020
[34]

John G Saw, Mark CK Yang, and Tse Chin Mo. 1984. Chebyshev inequality with estimated mean and variance.The American Statistician38, 2 (1984), 130–132

1984
[35]

Oguz Selvitopi, Saliya Ekanayake, Giulia Guidi, Georgios A Pavlopoulos, Ari- ful Azad, and Aydın Buluç. 2020. Distributed many-to-many protein sequence alignment using sparse matrices. InSC20: International Conference for High Per- formance Computing, Networking, Storage and Analysis. IEEE, 1–14

2020
[36]

Alok Tripathy, Katherine Yelick, and Aydın Buluç. 2020. Reducing communication in graph neural network training. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–14

2020
[37]

Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. 2019. Deep Graph Library: A Graph-Centric, Highly- Performant Package for Graph Neural Networks.arXiv preprint arXiv:1909.01315 (2019)

work page arXiv 2019
[38]

Yizhuo Wang, Hongpeng Lin, Bingxin Wei, Jianhua Gao, and Weixing Ji. 2025. Optimizing General Sparse Matrix-Matrix Multiplication on the GPU.ACM Trans. Archit. Code Optim.(Nov. 2025), 3774654. doi:10.1145/3774654

work page doi:10.1145/3774654 2025
[39]

2013.The cuda handbook: A comprehensive guide to gpu program- ming

Nicholas Wilt. 2013.The cuda handbook: A comprehensive guide to gpu program- ming. Pearson Education

2013
[40]

Martin Winter, Daniel Mlakar, Rhaleb Zayer, Hans-Peter Seidel, and Markus Stein- berger. 2019. Adaptive sparse matrix-matrix multiplication on the GPU. InPro- ceedings of the 24th Symposium on Principles and Practice of Parallel Programming. ACM, Washington District of Columbia, 68–81. doi:10.1145/3293883.3295701

work page doi:10.1145/3293883.3295701 2019
[41]

Martin Winter, Daniel Mlakar, Rhaleb Zayer, Hans-Peter Seidel, and Markus Steinberger. 2019. Adaptive sparse matrix-matrix multiplication on the GPU. In Proceedings of the 24th symposium on principles and practice of parallel program- ming. 68–81

2019
[42]

Min Wu, Huizhang Luo, Fenfang Li, Yiran Zhang, Zhuo Tang, Kenli Li, Jeff Zhang, and Chubo Liu. 2025. HSMU-SpGEMM: Achieving High Shared Memory Utiliza- tion for Parallel Sparse General Matrix-Matrix Multiplication on Modern GPUs. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, Las Vegas, NV, USA, 1452–1466. doi:...

work page doi:10.1109/hpca61900.2025 2025