pith. machine review for the scientific record. sign in

arxiv: 2604.19004 · v1 · submitted 2026-04-21 · 💻 cs.DC · cs.MS

Recognition: unknown

Ocean: Fast Estimation-Based Sparse General Matrix-Matrix Multiplication on GPU

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:34 UTC · model grok-4.3

classification 💻 cs.DC cs.MS
keywords SpGEMMGPUHyperLogLogsparse matrix multiplicationestimationhybrid accumulatorsymbolic passworkload imbalance
0
0 comments X

The pith

Estimation replaces exact symbolic passes to speed up GPU SpGEMM computations

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the symbolic pass in two-pass GPU SpGEMM, which determines output row sizes and accounts for roughly 28 percent of runtime, can be replaced by lightweight HyperLogLog estimators. This change, paired with dynamic workflow selection and a hybrid accumulator that mixes shared and global memory hashing, maintains result correctness while cutting overhead in irregular sparse computations. Sympathetic readers would care because SpGEMM supports key workloads in simulations, graph analytics, and machine learning, where memory irregularity currently limits GPU efficiency. The approach delivers measured speedups of 1.4x to 2.8x over leading implementations on A100 and H100 GPUs for both square and rectangular matrices.

Core claim

Our approach replaces the costly symbolic step with lightweight HyperLogLog estimators, combined with an analysis strategy that dynamically selects the optimal workflow and guides accumulator configuration. In addition, we introduce a hybrid accumulator design, including a novel hash-based accumulator that leverages both shared and global memory. This estimation-based SpGEMM workflow consistently outperforms leading GPU SpGEMM implementations across a wide range of both square and rectangular matrices.

What carries the argument

HyperLogLog cardinality estimators that approximate output row sizes to replace exact symbolic counting, driving dynamic workflow choice and configuration of a hybrid shared-plus-global hash accumulator.

Load-bearing premise

HyperLogLog estimation error stays small enough that the selected workflow and accumulator still produce correct results and do not introduce hidden overhead that cancels the reported gains for the tested matrix distributions.

What would settle it

Measure output correctness and total runtime when the method is applied to matrices engineered to produce large HyperLogLog errors, such as highly skewed row-density distributions, and check whether speedups vanish or errors appear.

Figures

Figures reproduced from arXiv: 2604.19004 by Giulia Guidi, Yifan Li.

Figure 1
Figure 1. Figure 1: Computation of a single output row using Gus [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Common GPU SpGEMM Workflow. a unified L2 cache. Global memory is shared by SMs and is much larger but significantly slower. Because accumulators are accessed frequently, they are typically placed in scratchpad memory. The simplest design is the dense accumulator, in which the entire output row is represented as a dense array [18]. This approach provides fast access but is impractical for many matrices beca… view at source ↗
Figure 3
Figure 3. Figure 3: The estimation of per-row nonzeros using Hyper [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the Ocean SpGEMM workflow, with estimation-based components highlighted in darker back￾grounds. The analysis step gathers metrics for workflow se￾lection and load balancing. Then, size prediction uses either symbolic accumulation or estimation to predict per-row out￾put sizes. Finally, numeric computation performs the mul￾tiplication using the collected statistics, followed by post￾processing t… view at source ↗
Figure 5
Figure 5. Figure 5: Computation of Input Expansion Ratio (𝐸𝑅) and Output Compression Ratio (𝐶𝑅). 𝐸𝑅 denotes the ratio be￾tween the number of intermediate products and the number of nonzeros in input matrix A. 𝐶𝑅 denotes the ratio between the number of intermediate products and the number of nonzeros in output matrix C. SpGEMM workflow and its key components to support estimation￾driven execution [PITH_FULL_IMAGE:figures/full… view at source ↗
Figure 6
Figure 6. Figure 6: The smoothed GFLOPS achieved over the square [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Relative speedup of SpGEMM solutions compared to [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: CDF of the average relative estimation error of HLL [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Runtime breakdown of individual components [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
read the original abstract

In computational science and data analytics, many workloads involve irregular and sparse computations that are inherently difficult to optimize for modern hardware. A key kernel is Sparse General Matrix-Matrix Multiplication (SpGEMM), which underpins simulations, graph analytics, and machine learning applications. SpGEMM exhibits irregular memory access patterns and workload imbalance, making it challenging to achieve high performance on GPUs. Current GPU SpGEMM solutions typically rely on a two-pass workflow to address load imbalance and reduce memory access. The symbolic pass, which determines the number of output elements per row, accounts for roughly 28% of the total runtime on average. In this work, we question the necessity of exact symbolic computation and introduce an estimation-based SpGEMM workflow. Our approach replaces the costly symbolic step with lightweight HyperLogLog estimators, combined with an analysis strategy that dynamically selects the optimal workflow and guides accumulator configuration. In addition, we introduce a hybrid accumulator design, including a novel hash-based accumulator that leverages both shared and global memory. Our approach consistently outperforms leading GPU SpGEMM implementations across a wide range of both square and rectangular matrices, achieving speedups of 1.4x-2.8x on NVIDIA A100 and H100 architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Ocean, an estimation-based SpGEMM algorithm for GPUs that replaces the exact symbolic pass (averaging 28% of runtime) with HyperLogLog cardinality estimators, adds dynamic workflow selection to choose optimal paths and accumulator configurations, and proposes a hybrid hash-based accumulator that uses both shared and global memory. It reports consistent speedups of 1.4x–2.8x versus leading GPU SpGEMM implementations on NVIDIA A100 and H100 GPUs across square and rectangular matrices.

Significance. If the HyperLogLog estimates maintain sufficient accuracy to guarantee correct output nnz counts and avoid hidden fallback overhead, the work would meaningfully advance GPU sparse linear algebra by eliminating a major source of overhead in irregular workloads. This could benefit graph analytics, scientific simulations, and ML kernels that rely on SpGEMM. The empirical timing comparisons to external baselines constitute the primary evidence; no parameter-free derivations or machine-checked proofs are present.

major comments (2)
  1. [Abstract] Abstract: the central claim that the estimation-based workflow 'consistently outperforms' leading implementations with 1.4x–2.8x speedups rests on the unverified assumption that HyperLogLog error remains small enough for the dynamic selector and hybrid accumulator to provision correct capacity. No quantitative error rates, failure-case analysis, or worst-case behavior for high-variance row distributions are supplied, so the correctness-to-performance link cannot be assessed from the given text.
  2. [Abstract] Abstract and evaluation description: the reported speedups are presented without accompanying data on estimation overhead, the fraction of cases where the dynamic selector chooses a sub-optimal path, or full benchmark tables that would allow verification that net gains are not canceled by under-allocation or fallback costs.
minor comments (1)
  1. The abstract refers to 'a wide range of both square and rectangular matrices' without naming the specific test suites or matrix properties (e.g., nnz distribution, aspect ratios), which would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment point by point below. Where the feedback identifies gaps in the presentation of supporting data or analysis, we have revised the manuscript to incorporate the requested information.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the estimation-based workflow 'consistently outperforms' leading implementations with 1.4x–2.8x speedups rests on the unverified assumption that HyperLogLog error remains small enough for the dynamic selector and hybrid accumulator to provision correct capacity. No quantitative error rates, failure-case analysis, or worst-case behavior for high-variance row distributions are supplied, so the correctness-to-performance link cannot be assessed from the given text.

    Authors: We agree that the abstract as written does not supply quantitative error rates or worst-case analysis, making it difficult for readers to directly assess the assumption. The full manuscript reports end-to-end timings that implicitly reflect any estimation inaccuracies (as incorrect capacity would trigger fallback or incorrect results), but we acknowledge this is insufficient. We will revise the abstract to include a concise statement on observed estimator accuracy and add a new subsection in the evaluation that provides quantitative error rates, failure-case analysis, and behavior for high-variance row distributions. revision: yes

  2. Referee: [Abstract] Abstract and evaluation description: the reported speedups are presented without accompanying data on estimation overhead, the fraction of cases where the dynamic selector chooses a sub-optimal path, or full benchmark tables that would allow verification that net gains are not canceled by under-allocation or fallback costs.

    Authors: The evaluation section already measures and reports component runtimes that include estimation overhead and dynamic selector decisions. However, we accept that the abstract and main text do not present these breakdowns or full per-matrix tables explicitly enough to let readers verify net gains independently. We will expand the evaluation section with additional tables and figures showing estimation overhead percentages, the fraction of sub-optimal selector choices, and complete benchmark results per matrix to confirm that reported speedups are not offset by under-allocation or fallback costs. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical SpGEMM estimation workflow

full rationale

The paper replaces the exact symbolic pass in GPU SpGEMM with HyperLogLog cardinality estimation, adds dynamic workflow selection, and a hybrid hash-based accumulator, then reports empirical speedups of 1.4x-2.8x versus external baselines on A100/H100. No mathematical derivation, prediction, or first-principles result is claimed; the central performance claims rest on direct timing measurements against independent implementations rather than any fitted parameter renamed as output, self-citation chain, or self-definitional reduction. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The method assumes HyperLogLog cardinality estimation is accurate enough for row non-zero counts in typical sparse matrices and that the dynamic selector can reliably pick the best accumulator strategy from the estimate; both are domain assumptions rather than proven invariants.

free parameters (2)
  • HyperLogLog precision bits
    Controls estimation accuracy versus memory and speed; must be chosen to keep error low enough for the downstream workflow decisions.
  • Dynamic selection thresholds
    Cutoffs that decide when to use estimation versus fallback or which accumulator variant to invoke; these are tuned on the benchmark set.

pith-pipeline@v0.9.0 · 5513 in / 1267 out tokens · 37799 ms · 2026-05-10T02:34:10.270930+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 19 canonical work pages

  1. [1]

    Rasmus Resen Amossen, Andrea Campagna, and Rasmus Pagh. 2014. Better Size Estimation for Sparse Matrix Products.Algorithmica69, 3 (July 2014), 741–757. doi:10.1007/s00453-012-9692-9

  2. [2]

    Pham Nguyen Quang Anh, Rui Fan, and Yonggang Wen. 2016. Balanced hashing and efficient gpu sparse general matrix-matrix multiplication. InProceedings of the 2016 International Conference on Supercomputing. 1–12

  3. [3]

    Ariful Azad, Georgios A Pavlopoulos, Christos A Ouzounis, Nikos C Kyrpides, and Aydin Buluç. 2018. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks.Nucleic acids research 46, 6 (2018), e33–e33

  4. [4]

    Nathan Bell, Steven Dalton, and Luke N. Olson. 2012. Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods.SIAM Journal on Scientific Computing 34, 4 (2012), C123–C152. doi:10.1137/110838844

  5. [5]

    Julian Bellavita, Lorenzo Pichetti, Thomas Pasquali, Flavio Vella, and Giulia Guidi

  6. [6]

    InProceedings of the 2026 International Conference on Supercomputing (ICS ’26)(Belfast, United Kingdom)

    Communication-Avoiding SpGEMM via Trident Partitioning on Hierar- chical GPU Interconnects. InProceedings of the 2026 International Conference on Supercomputing (ICS ’26)(Belfast, United Kingdom). ACM, New York, NY, USA, 1–13. doi:10.1145/3797905.3800543

  7. [7]

    Benjamin Brock, Aydın Buluç, and Katherine Yelick. 2024. RDMA-based Algo- rithms for Sparse Matrix Multiplication on GPUs. InProceedings of the 38th ACM International Conference on Supercomputing. 225–235

  8. [8]

    Aydın Buluç. 2025. The Ubiquitous Sparse Matrix-Matrix Products. (2025). arXiv:2508.04077 [math.NA] https://arxiv.org/abs/2508.04077

  9. [9]

    Edith Cohen. 1996. On optimizing multiplications of sparse matrices. InInter- national Conference on Integer Programming and Combinatorial Optimization. Springer, 219–233

  10. [10]

    Edith Cohen. 1997. Size-Estimation Framework with Applications to Transitive Closure and Reachability.J. Comput. System Sci.55, 3 (Dec. 1997), 441–453. doi:10.1006/jcss.1997.1534

  11. [11]

    Steven Dalton, Sean Baxter, Duane Merrill, Luke Olson, and Michael Garland

  12. [12]

    In2015 IEEE International Parallel and Distributed Processing Symposium

    Optimizing sparse matrix operations on gpus using merge path. In2015 IEEE International Parallel and Distributed Processing Symposium. IEEE, 407–416

  13. [13]

    Zhaoyang Du, Yijin Guan, Tianchan Guan, Dimin Niu, Linyong Huang, Hongzhong Zheng, and Yuan Xie. 2022. OpSparse: A Highly Optimized Frame- work for Sparse General Matrix Multiplication on GPUs.IEEE Access10 (2022), 85960–85974. doi:10.1109/ACCESS.2022.3196940

  14. [14]

    Zhaoyang Du, Yijin Guan, Tianchan Guan, Dimin Niu, Nianxiong Tan, Xiaopeng Yu, Hongzhong Zheng, Jianyi Meng, Xiaolang Yan, and Yuan Xie. 2022. Predicting the Output Structure of Sparse Matrix Multiplication with Sampled Compression Ratio. doi:10.48550/arXiv.2207.13848 arXiv:2207.13848 [cs]

  15. [15]

    Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyper- LogLog: the analysis of a near-optimal cardinality estimation algorithm.Discrete Mathematics & Theoretical Computer ScienceDMTCS Proceedings vol. AH,..., Proceedings (Jan. 2007). doi:10.46298/dmtcs.3545 Publisher: Centre pour la Communication Scientifique Directe (CCSD). Oc...

  16. [16]

    Jianhua Gao, Weixing Ji, Fangli Chang, Shiyu Han, Bingxin Wei, Zeming Liu, and Yizhuo Wang. 2023. A Systematic Survey of General Sparse Matrix-matrix Multiplication.ACM Comput. Surv.55, 12 (Dec. 2023), 1–36. doi:10.1145/3571157

  17. [17]

    John R Gilbert, Cleve Moler, and Robert Schreiber. 1992. Sparse matrices in MATLAB: Design and implementation.SIAM journal on matrix analysis and applications13, 1 (1992), 333–356

  18. [18]

    Giulia Guidi, Marquita Ellis, Daniel Rokhsar, Katherine Yelick, and Aydın Buluç

  19. [19]

    InSIAM Conference on Applied and Computational Discrete Algorithms (ACDA21)

    BELLA: Berkeley efficient long-read to long-read aligner and overlapper. InSIAM Conference on Applied and Computational Discrete Algorithms (ACDA21). SIAM, 123–134

  20. [20]

    Giulia Guidi, Oguz Selvitopi, Marquita Ellis, Leonid Oliker, Katherine Yelick, and Aydın Buluç. 2021. Parallel string graph construction and transitive reduction for de novo genome assembly. In2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 517–526

  21. [21]

    Fred G Gustavson. 1978. Two fast algorithms for sparse matrices: Multiplication and permuted transposition.ACM Transactions on Mathematical Software (TOMS) 4, 3 (1978), 250–269

  22. [22]

    Stefan Heule, Marc Nunkesser, and Alexander Hall. 2013. HyperLogLog in prac- tice: algorithmic engineering of a state of the art cardinality estimation algorithm. InProceedings of the 16th International Conference on Extending Database Tech- nology. ACM, Genoa Italy, 683–692. doi:10.1145/2452376.2452456

  23. [23]

    Abdullah Al Raqibul Islam, Helen Xu, Dong Dai, and Aydın Buluç. 2025. Im- proving SpGEMM Performance Through Matrix Reordering and Cluster-wise Computation. doi:10.48550/arXiv.2507.21253 arXiv:2507.21253 [cs]

  24. [24]

    Scarpazza

    Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. doi:10. 48550/arXiv.1804.06826 arXiv:1804.06826 [cs]

  25. [25]

    1998.The Art of Computer Programming: Sorting and Searching, Volume 3

    Donald E Knuth. 1998.The Art of Computer Programming: Sorting and Searching, Volume 3. Addison-Wesley Professional

  26. [26]

    Weifeng Liu and Brian Vinter. 2014. An efficient GPU general sparse matrix- matrix multiplication for irregular data. In2014 IEEE 28th international parallel and distributed processing symposium. IEEE, 370–381

  27. [27]

    Yusuke Nagasaka, Satoshi Matsuoka, Ariful Azad, and Aydın Buluç. 2019. Perfor- mance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors.Parallel Comput.90 (2019), 102545

  28. [28]

    Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2017. High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU. In2017 46th International Conference on Parallel Processing (ICPP). IEEE, Bristol, United Kingdom, 101–110. doi:10.1109/ICPP.2017.19

  29. [29]

    Yuyao Niu, Zhengyang Lu, Haonan Ji, Shuhui Song, Zhou Jin, and Weifeng Liu

  30. [30]

    InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

    TileSpGEMM: a tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, Seoul Republic of Korea, 90–106. doi:10.1145/3503221.3508431

  31. [31]

    NVIDIA Corporation. 2026. cuSPARSE Library. https://docs.nvidia.com/cuda/ cusparse/. Version 13.1

  32. [32]

    2026.Volta Tuning Guide Release 13.1

    NVIDIA Corporation. 2026.Volta Tuning Guide Release 13.1. NVIDIA Corporation. https://docs.nvidia.com/cuda/volta-tuning-guide/index.html

  33. [33]

    Mathias Parger, Martin Winter, Daniel Mlakar, and Markus Steinberger. 2020. spECK: accelerating GPU sparse matrix-matrix multiplication through light- weight analysis. InProceedings of the 25th ACM SIGPLAN Symposium on Princi- ples and Practice of Parallel Programming. ACM, San Diego California, 362–375. doi:10.1145/3332466.3374521

  34. [34]

    John G Saw, Mark CK Yang, and Tse Chin Mo. 1984. Chebyshev inequality with estimated mean and variance.The American Statistician38, 2 (1984), 130–132

  35. [35]

    Oguz Selvitopi, Saliya Ekanayake, Giulia Guidi, Georgios A Pavlopoulos, Ari- ful Azad, and Aydın Buluç. 2020. Distributed many-to-many protein sequence alignment using sparse matrices. InSC20: International Conference for High Per- formance Computing, Networking, Storage and Analysis. IEEE, 1–14

  36. [36]

    Alok Tripathy, Katherine Yelick, and Aydın Buluç. 2020. Reducing communication in graph neural network training. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–14

  37. [37]

    Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. 2019. Deep Graph Library: A Graph-Centric, Highly- Performant Package for Graph Neural Networks.arXiv preprint arXiv:1909.01315 (2019)

  38. [38]

    Yizhuo Wang, Hongpeng Lin, Bingxin Wei, Jianhua Gao, and Weixing Ji. 2025. Optimizing General Sparse Matrix-Matrix Multiplication on the GPU.ACM Trans. Archit. Code Optim.(Nov. 2025), 3774654. doi:10.1145/3774654

  39. [39]

    2013.The cuda handbook: A comprehensive guide to gpu program- ming

    Nicholas Wilt. 2013.The cuda handbook: A comprehensive guide to gpu program- ming. Pearson Education

  40. [40]

    Martin Winter, Daniel Mlakar, Rhaleb Zayer, Hans-Peter Seidel, and Markus Stein- berger. 2019. Adaptive sparse matrix-matrix multiplication on the GPU. InPro- ceedings of the 24th Symposium on Principles and Practice of Parallel Programming. ACM, Washington District of Columbia, 68–81. doi:10.1145/3293883.3295701

  41. [41]

    Martin Winter, Daniel Mlakar, Rhaleb Zayer, Hans-Peter Seidel, and Markus Steinberger. 2019. Adaptive sparse matrix-matrix multiplication on the GPU. In Proceedings of the 24th symposium on principles and practice of parallel program- ming. 68–81

  42. [42]

    Min Wu, Huizhang Luo, Fenfang Li, Yiran Zhang, Zhuo Tang, Kenli Li, Jeff Zhang, and Chubo Liu. 2025. HSMU-SpGEMM: Achieving High Shared Memory Utiliza- tion for Parallel Sparse General Matrix-Matrix Multiplication on Modern GPUs. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, Las Vegas, NV, USA, 1452–1466. doi:...