pith. machine review for the scientific record. sign in

arxiv: 2601.16294 · v2 · submitted 2026-01-22 · 💻 cs.DC · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Space Filling Curves is All You Need: Communication-Avoiding Matrix Multiplication Made Simple

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:34 UTC · model grok-4.3

classification 💻 cs.DC cs.AI
keywords space filling curvesGEMMcommunication avoiding algorithmsmatrix multiplicationdata localityhigh performance computingCPU performance
0
0 comments X

The pith

Space-filling curves enable simple communication-avoiding matrix multiplication without platform-specific tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes using space-filling curves to partition general matrix multiplication in a way that delivers high data locality regardless of matrix shape or target CPU. This partitioning extends directly to communication-avoiding algorithms that minimize data movement across the memory hierarchy. The resulting code remains compact yet achieves performance that matches or exceeds specialized vendor libraries on multiple platforms. Real-world tests show the approach speeds up large language model inference prefill and distributed matrix multiplication.

Core claim

Space-filling curve partitioning yields platform-oblivious and shape-oblivious matrix multiplication schemes with high data locality, and extends seamlessly to provably optimal communication-avoiding algorithms that minimize data movement in compact code.

What carries the argument

Space-filling curves for work partitioning that enforce data locality and extend to communication-avoiding GEMM

If this is right

  • Achieves up to 5.5x speedup over vendor libraries for certain GEMM shapes with 1.8x weighted harmonic mean improvement.
  • Provides up to 1.85x speedup when used as backend for LLM inference prefill.
  • Delivers up to 2.2x speedup for distributed-memory matrix multiplication.
  • Reduces reliance on exhaustive tuning of tensor layouts, parallelization, and cache blocking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same partitioning principle could simplify optimization for other dense linear algebra kernels such as triangular solves or factorizations.
  • Platform-oblivious methods may reduce porting effort when deploying HPC codes across evolving CPU architectures.
  • Compact implementations like this could lower the barrier for custom high-performance backends in emerging AI frameworks.

Load-bearing premise

Space-filling curve partitioning inherently delivers high data locality for arbitrary matrix shapes and platforms, allowing seamless extension to provably optimal communication-avoiding algorithms without hidden platform-specific adjustments.

What would settle it

A measurement on any tested CPU showing the SFC implementation moves more data or runs slower than a vendor library for a range of GEMM shapes would falsify the claim of consistent outperformance and simplicity.

Figures

Figures reproduced from arXiv: 2601.16294 by Alexander Heinecke, Evangelos Georganas, Pradeep Dubey.

Figure 1
Figure 1. Figure 1: Multi-core GEMM performance (Bfloat16) of a vendor [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SFC-based traversal of 16×16 grid. The SFC used is the generalized 2D Hilbert curve. The numbers in the grid boxes indicate the order (i.e. timestamp) of the traversal, also color-coded according to the heatmap bar on the right. II. SFC-BASED COMMUNICATION-AVOIDING GEMM A. Generalized Hilbert SFC The discrete Hilbert curve is a widely used space-filling curve to map between N-dimensional and 1-D spaces whi… view at source ↗
Figure 3
Figure 3. Figure 3: SFC-based partitioning of C matrix using 64 cores. The C dimensions are 4096×4096, and by using blocks of size 32×32 we obtain a grid of 128×128 C blocks. Left: 2D C decomposition, the SFC yields a 2D core decomposition (grid of 8×8 cores). Each core is implicitly assigned a C tile consisting of 16×16 C blocks (each having size 32×32). Each core￾local grid of 16×16 C blocks/C-tasks is traversed by the core… view at source ↗
Figure 4
Figure 4. Figure 4: SFC-based partitioning of C matrices with different M:N aspect ratios using 64 cores. The SFC-based partitioning yields 2D core decompositions with aspect ratios matching the corresponding M:N ratio. proper loop-index mapping in lines 12-14. E. Communication optimality of SFC-CA GEMM Algorithm In this subsection we sketch out how our SFC-CA GEMM algorithm achieves asymptotically the lower bound mentioned i… view at source ↗
Figure 3
Figure 3. Figure 3: Assuming we have more available vacant space in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: EMR BF16 GEMM performance for oneDNN and SFC-CA GEMM for a range of matrices ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Total L2 misses on EMR for two problem sizes. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: GNR BF16 GEMM performance for oneDNN and SFC-CA GEMM for a range of [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: ZEN5 BF16 GEMM performance for oneDNN and SFC-CA GEMM for a range of [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: GVT4 BF16 GEMM for ACL (via the oneDNN frontend) and SFC-CA GEMM for a range of [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
read the original abstract

General Matrix Multiplication (GEMM) is the cornerstone of HPC workloads and Deep Learning. State-of-the-art vendor libraries tune tensor layouts, parallelization schemes, and cache blocking to minimize data movement across the memory hierarchy and maximize throughput. Optimal settings for these parameters depend on the target platform and matrix shapes, making exhaustive tuning infeasible. We revisit Space Filling Curves (SFC) to alleviate this cumbersome tuning. We partition the Matrix Multiplication using advancements in SFC, and obtain platform-oblivious and shape-oblivious Matrix Multiplication schemes with high degree of data locality. We extend the SFC-based work partitioning to implement Communication-Avoiding (CA) algorithms that provably minimize data movement. The integration of CA-algorithms is seamless with compact code, achieving state-of-the-art results on multiple CPU platforms, outperforming vendor libraries up to 5.5x for a range of GEMM-shapes (1.8x Weighted Harmonic Mean speedup). We show the impact of our work on two real-world applications by leveraging our GEMM as compute backend: i) prefill of LLM inference with speedups up to 1.85x over State-Of-The-Art, and ii) distributed-memory Matrix Multiplication with speedups up to 2.2x.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes using Space Filling Curves (SFC) for work partitioning in General Matrix Multiplication (GEMM) to achieve high data locality that is platform- and shape-oblivious. It extends this to Communication-Avoiding (CA) algorithms asserted to provably minimize data movement, implemented with compact code. Empirical results on multiple CPU platforms show speedups up to 5.5x over vendor libraries (1.8x weighted harmonic mean), with applications to LLM inference prefill (up to 1.85x) and distributed-memory MM (up to 2.2x).

Significance. If the SFC partitioning indeed yields provably communication-optimal CA-GEMM for general shapes without hidden platform adjustments, the work would offer a simple, tuning-free alternative to complex cache-blocking and parallelization schemes in vendor libraries, with clear practical impact on HPC and deep learning workloads.

major comments (1)
  1. [Abstract] Abstract: the central claim that SFC-based partitioning extends to CA algorithms that 'provably minimize data movement' lacks any derivation of induced communication volume as a function of m,n,k and local memory M, or comparison against known lower bounds such as Ω(mnk/√M) from Ballard et al. or Irony et al. This is load-bearing for the 'provably' qualifier and the extension to CA-GEMM.
minor comments (1)
  1. The experimental claims reference specific speedups and GEMM shapes but provide no details on methodology, matrix dimensions tested, or error bars, limiting assessment of generality.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the need to substantiate the 'provably' claim in the abstract. We address the point directly below and will revise the manuscript to include the requested derivation and comparison.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that SFC-based partitioning extends to CA algorithms that 'provably minimize data movement' lacks any derivation of induced communication volume as a function of m,n,k and local memory M, or comparison against known lower bounds such as Ω(mnk/√M) from Ballard et al. or Irony et al. This is load-bearing for the 'provably' qualifier and the extension to CA-GEMM.

    Authors: We agree that the abstract claim requires explicit support. The body of the manuscript (Section 4) derives the communication volume induced by the SFC partitioning: for a local memory of size M the volume is at most 3mnk/√M + O(mn + nk + mk), which matches the Ω(mnk/√M) lower bound of Ballard et al. and Irony et al. up to lower-order terms. We will revise the abstract to state this bound explicitly and add a short paragraph in Section 4 that directly compares the SFC volume against the known lower bounds, thereby justifying the 'provably' qualifier for the CA-GEMM extension. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic construction is self-contained

full rationale

The paper constructs GEMM partitioning via space-filling curves and extends it to CA variants by describing the partitioning scheme and its locality properties. No equation reduces a claimed prediction or optimality result to a fitted parameter or prior self-citation by construction. The 'provably minimize' phrasing is presented as following from the SFC properties without an explicit lower-bound derivation in the provided text, but this absence does not create a circular reduction; it is simply an unproven claim rather than a self-referential one. Empirical speedups are reported as measurements, not as inputs to the derivation. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated assumption that space-filling curves deliver high data locality when used for work partitioning and that this property extends directly to provably optimal communication-avoiding schemes; no free parameters or invented entities are mentioned.

axioms (1)
  • domain assumption Space-filling curves provide high data locality for matrix multiplication partitioning across platforms and shapes
    Invoked as the basis for platform-oblivious and shape-oblivious schemes in the abstract.

pith-pipeline@v0.9.0 · 5534 in / 1307 out tokens · 35000 ms · 2026-05-16T11:34:14.888675+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We partition the Matrix Multiplication computation space using recent advancements in generalized space filling curves (Generalized Hilbert Curves [15]), and we obtain platform-oblivious and shape-oblivious matrix-multiplication schemes that exhibit inherently high degree of data locality... extend the SFC-based work partitioning to implement Communication-Avoiding (CA) 2.5D GEMM algorithms that replicate the input tensors and provably minimize communication/data-movement

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Hilbert curves split a finite 2D space into recursive quadrants and traverse each quadrant in recursive “U” shapes... the thread decompositions that partition the 1D SFC-index space in a block fashion naturally result in thread grids with aspect ratios matching the aspect ratios of the partitioned 2D space

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 4 internal anchors

  1. [1]

    Sutskever, and G.E

    Alex Krizhevsky, I. Sutskever, and G.E. Hinton. Image classification with deep convolutional neural networks.Advances in neural informa- tion processing systems, pages 1097–1105, 2012

  2. [2]

    Going deeper with convolutions

    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1– 9, 2015

  3. [3]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolu- tional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014

  4. [4]

    Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks

    Dong Yu, Michael L Seltzer, Jinyu Li, Jui-Ting Huang, and Frank Seide. Feature learning in deep neural networks-studies on speech recognition tasks.arXiv preprint arXiv:1301.3605, 2013

  5. [5]

    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

    Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation.arXiv preprint arXiv:1609.08144, 2016

  6. [6]

    Wide & deep learning for recommender systems

    Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pages 7–10. ACM, 2016

  7. [7]

    Transformers: State-of-the-art natural language processing

    Thomas Wolf, Julien Chaumond, Lysandre Debut, Victor Sanh, Clement Delangue, Anthony Moi, Pierric Cistac, Morgan Funtowicz, Joe Davi- son, Sam Shleifer, et al. Transformers: State-of-the-art natural language processing. InProceedings of the 2020 Conference on Empirical Meth- ods in Natural Language Processing: System Demonstrations, pages 38– 45, 2020

  8. [8]

    Deep learning in drug discovery.Molecular informatics, 35(1):3–14, 2016

    Erik Gawehn, Jan A Hiss, and Gisbert Schneider. Deep learning in drug discovery.Molecular informatics, 35(1):3–14, 2016

  9. [9]

    Deep learning for computational chemistry.Journal of computational chemistry, 38(16):1291–1307, 2017

    Garrett B Goh, Nathan O Hodas, and Abhinav Vishnu. Deep learning for computational chemistry.Journal of computational chemistry, 38(16):1291–1307, 2017

  10. [10]

    A survey of deep learning for scientific discovery.arXiv preprint arXiv:2003.11755, 2020

    Maithra Raghu and Eric Schmidt. A survey of deep learning for scientific discovery.arXiv preprint arXiv:2003.11755, 2020

  11. [11]

    High-performance tensor computations in scientific com- puting and data science.Frontiers in Applied Mathematics and Statistics, page 93, 2022

    Edoardo Angelo Di Napoli, Paolo Bientinesi, Jiajia Li, and Andr ´e Uschmajew. High-performance tensor computations in scientific com- puting and data science.Frontiers in Applied Mathematics and Statistics, page 93, 2022

  12. [12]

    Tensor processing primitives: A programming abstraction for efficiency and portability in deep learning workloads

    Evangelos Georganas, Dhiraj Kalamkar, Sasikanth Avancha, Menachem Adelman, Cristina Anderson, Alexander Breuer, Jeremy Bruestle, Naren- dra Chaudhary, Abhisek Kundu, Denise Kutnick, et al. Tensor processing primitives: A programming abstraction for efficiency and portability in deep learning workloads. InProceedings of the International Conference for Hig...

  13. [14]

    Harnessing deep learning via a single building block

    Evangelos Georganas, Kunal Banerjee, Dhiraj Kalamkar, Sasikanth Avancha, Anand Venkat, Michael Anderson, Greg Henry, Hans Pabst, and Alexander Heinecke. Harnessing deep learning via a single building block. In2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 222–233. IEEE, 2020

  14. [15]

    Jakub ˇCerven´y. Gilbert. https://github.com/jakubcerveny/gilbert, 2019

  15. [16]

    Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms

    Alexander Heinecke and Michael Bader. Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms. InProceedings of the 2008 workshop on Memory access on future processors: a solved problem?, pages 385–392, 2008

  16. [17]

    Communication-optimal parallel 2.5 d matrix multiplication and lu factorization algorithms

    Edgar Solomonik and James Demmel. Communication-optimal parallel 2.5 d matrix multiplication and lu factorization algorithms. InEuropean Conference on Parallel Processing, pages 90–109. Springer, 2011

  17. [18]

    Communication avoiding and overlapping for numerical linear algebra

    Evangelos Georganas, Jorge Gonz ´alez-Dom´ınguez, Edgar Solomonik, Yili Zheng, Juan Tourino, and Katherine Yelick. Communication avoiding and overlapping for numerical linear algebra. InSC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1–11. IEEE, 2012

  18. [19]

    Communication-optimal parallel recursive rectangular matrix multiplication

    James Demmel, David Eliahu, Armando Fox, Shoaib Kamil, Benjamin Lipshitz, Oded Schwartz, and Omer Spillinger. Communication-optimal parallel recursive rectangular matrix multiplication. In2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pages 261–272. IEEE, 2013

  19. [20]

    Scalable universal matrix multiplication algorithms: 2d and 3d variations on a theme.submitted to ACM Transactions on Mathematical Software, pages 1–30, 2012

    Martin D Schatz, Jack Poulson, and Robert A van de Geijn. Scalable universal matrix multiplication algorithms: 2d and 3d variations on a theme.submitted to ACM Transactions on Mathematical Software, pages 1–30, 2012

  20. [21]

    Harnessing deep learning and hpc kernels via high-level loop and tensor abstractions on cpu architectures

    Evangelos Georganas, Dhiraj Kalamkar, Kirill V oronin, Abhisek Kundu, Antonio Noack, Hans Pabst, Alexander Breuer, and Alexander Heinecke. Harnessing deep learning and hpc kernels via high-level loop and tensor abstractions on cpu architectures. In2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 950–963. IEEE, 2024

  21. [22]

    Roofline: an insightful visual performance model for multicore architectures

    Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009

  22. [23]

    Libxsmm: accelerating small matrix multiplications by runtime code generation

    Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. Libxsmm: accelerating small matrix multiplications by runtime code generation. InSC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 981–991. IEEE, 2016

  23. [24]

    archbench

    Alexander Heinecke. archbench. https://github.com/alheinecke/ ArchBenchSuite, 2026

  24. [25]

    https://www.intel.com/content/www/us/en/content-details/671368/intel- architecture-instruction-set-extensions-programming-reference.html, Accessed on 1/21/2026

    Intel Architecture Instruction Set Extensions and Future Features Programming Reference. https://www.intel.com/content/www/us/en/content-details/671368/intel- architecture-instruction-set-extensions-programming-reference.html, Accessed on 1/21/2026

  25. [26]

    https://github.com/oneapi-src/onednn, Accessed on 4/6/2023

    Intel oneDNN. https://github.com/oneapi-src/onednn, Accessed on 4/6/2023

  26. [27]

    https://github.com/arm-software/computelibrary, Accessed on 4/6/2023

    ARM. https://github.com/arm-software/computelibrary, Accessed on 4/6/2023

  27. [28]

    LIBXSMM: Accelerating small matrix multiplications by runtime code generation

    Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. LIBXSMM: Accelerating small matrix multiplications by runtime code generation. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16, pages 84:1–84:11, Piscataway, NJ, USA, 2016. IEEE Press

  28. [29]

    Using bfloat16 with TensorFlow models

    “Using bfloat16 with TensorFlow models”. https://cloud.google.com/tpu/docs/bfloat16, Accessed on 4/3/2019

  29. [30]

    Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication

    Grzegorz Kwasniewski, Marko Kabi ´c, Maciej Besta, Joost VandeV on- dele, Raffaele Solc `a, and Torsten Hoefler. Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–22, 2019

  30. [31]

    A study of energy and locality effects using space-filling curves

    Nico Reissman, Jan Christian Meyer, and Magnus Jahre. A study of energy and locality effects using space-filling curves. In2014 IEEE International Parallel & Distributed Processing Symposium Workshops, pages 815–822. IEEE, 2014

  31. [32]

    In13th{USENIX}Symposium on Operating Systems Design and Implementation ({OSDI}18), pages 578–594, 2018

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al.{TVM}: An automated end-to-end optimizing compiler for deep learning. In13th{USENIX}Symposium on Operating Systems Design and Implementation ({OSDI}18), pages 578–594, 2018

  32. [33]

    Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions

    Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, An- drew Adams, and Albert Cohen. Tensor comprehensions: Framework- agnostic high-performance machine learning abstractions.arXiv preprint arXiv:1802.04730, 2018

  33. [34]

    Taco: A tool to generate tensor algebra kernels

    Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Amarasinghe. Taco: A tool to generate tensor algebra kernels. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 943–948. IEEE, 2017

  34. [35]

    onednn graph compiler: A hybrid approach for high-performance deep learning compilation

    Jianhui Li, Zhennan Qin, Yijie Mei, Jingze Cui, Yunfei Song, Ciyong Chen, Yifei Zhang, Longsheng Du, Xianhang Cheng, Baihui Jin, et al. onednn graph compiler: A hybrid approach for high-performance deep learning compilation. In2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 460–470. IEEE, 2024. Optimization Notice: S...