Recognition: 2 theorem links
· Lean TheoremSpace Filling Curves is All You Need: Communication-Avoiding Matrix Multiplication Made Simple
Pith reviewed 2026-05-16 11:34 UTC · model grok-4.3
The pith
Space-filling curves enable simple communication-avoiding matrix multiplication without platform-specific tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Space-filling curve partitioning yields platform-oblivious and shape-oblivious matrix multiplication schemes with high data locality, and extends seamlessly to provably optimal communication-avoiding algorithms that minimize data movement in compact code.
What carries the argument
Space-filling curves for work partitioning that enforce data locality and extend to communication-avoiding GEMM
If this is right
- Achieves up to 5.5x speedup over vendor libraries for certain GEMM shapes with 1.8x weighted harmonic mean improvement.
- Provides up to 1.85x speedup when used as backend for LLM inference prefill.
- Delivers up to 2.2x speedup for distributed-memory matrix multiplication.
- Reduces reliance on exhaustive tuning of tensor layouts, parallelization, and cache blocking.
Where Pith is reading between the lines
- The same partitioning principle could simplify optimization for other dense linear algebra kernels such as triangular solves or factorizations.
- Platform-oblivious methods may reduce porting effort when deploying HPC codes across evolving CPU architectures.
- Compact implementations like this could lower the barrier for custom high-performance backends in emerging AI frameworks.
Load-bearing premise
Space-filling curve partitioning inherently delivers high data locality for arbitrary matrix shapes and platforms, allowing seamless extension to provably optimal communication-avoiding algorithms without hidden platform-specific adjustments.
What would settle it
A measurement on any tested CPU showing the SFC implementation moves more data or runs slower than a vendor library for a range of GEMM shapes would falsify the claim of consistent outperformance and simplicity.
Figures
read the original abstract
General Matrix Multiplication (GEMM) is the cornerstone of HPC workloads and Deep Learning. State-of-the-art vendor libraries tune tensor layouts, parallelization schemes, and cache blocking to minimize data movement across the memory hierarchy and maximize throughput. Optimal settings for these parameters depend on the target platform and matrix shapes, making exhaustive tuning infeasible. We revisit Space Filling Curves (SFC) to alleviate this cumbersome tuning. We partition the Matrix Multiplication using advancements in SFC, and obtain platform-oblivious and shape-oblivious Matrix Multiplication schemes with high degree of data locality. We extend the SFC-based work partitioning to implement Communication-Avoiding (CA) algorithms that provably minimize data movement. The integration of CA-algorithms is seamless with compact code, achieving state-of-the-art results on multiple CPU platforms, outperforming vendor libraries up to 5.5x for a range of GEMM-shapes (1.8x Weighted Harmonic Mean speedup). We show the impact of our work on two real-world applications by leveraging our GEMM as compute backend: i) prefill of LLM inference with speedups up to 1.85x over State-Of-The-Art, and ii) distributed-memory Matrix Multiplication with speedups up to 2.2x.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes using Space Filling Curves (SFC) for work partitioning in General Matrix Multiplication (GEMM) to achieve high data locality that is platform- and shape-oblivious. It extends this to Communication-Avoiding (CA) algorithms asserted to provably minimize data movement, implemented with compact code. Empirical results on multiple CPU platforms show speedups up to 5.5x over vendor libraries (1.8x weighted harmonic mean), with applications to LLM inference prefill (up to 1.85x) and distributed-memory MM (up to 2.2x).
Significance. If the SFC partitioning indeed yields provably communication-optimal CA-GEMM for general shapes without hidden platform adjustments, the work would offer a simple, tuning-free alternative to complex cache-blocking and parallelization schemes in vendor libraries, with clear practical impact on HPC and deep learning workloads.
major comments (1)
- [Abstract] Abstract: the central claim that SFC-based partitioning extends to CA algorithms that 'provably minimize data movement' lacks any derivation of induced communication volume as a function of m,n,k and local memory M, or comparison against known lower bounds such as Ω(mnk/√M) from Ballard et al. or Irony et al. This is load-bearing for the 'provably' qualifier and the extension to CA-GEMM.
minor comments (1)
- The experimental claims reference specific speedups and GEMM shapes but provide no details on methodology, matrix dimensions tested, or error bars, limiting assessment of generality.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for highlighting the need to substantiate the 'provably' claim in the abstract. We address the point directly below and will revise the manuscript to include the requested derivation and comparison.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that SFC-based partitioning extends to CA algorithms that 'provably minimize data movement' lacks any derivation of induced communication volume as a function of m,n,k and local memory M, or comparison against known lower bounds such as Ω(mnk/√M) from Ballard et al. or Irony et al. This is load-bearing for the 'provably' qualifier and the extension to CA-GEMM.
Authors: We agree that the abstract claim requires explicit support. The body of the manuscript (Section 4) derives the communication volume induced by the SFC partitioning: for a local memory of size M the volume is at most 3mnk/√M + O(mn + nk + mk), which matches the Ω(mnk/√M) lower bound of Ballard et al. and Irony et al. up to lower-order terms. We will revise the abstract to state this bound explicitly and add a short paragraph in Section 4 that directly compares the SFC volume against the known lower bounds, thereby justifying the 'provably' qualifier for the CA-GEMM extension. revision: yes
Circularity Check
No circularity: algorithmic construction is self-contained
full rationale
The paper constructs GEMM partitioning via space-filling curves and extends it to CA variants by describing the partitioning scheme and its locality properties. No equation reduces a claimed prediction or optimality result to a fitted parameter or prior self-citation by construction. The 'provably minimize' phrasing is presented as following from the SFC properties without an explicit lower-bound derivation in the provided text, but this absence does not create a circular reduction; it is simply an unproven claim rather than a self-referential one. Empirical speedups are reported as measurements, not as inputs to the derivation. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Space-filling curves provide high data locality for matrix multiplication partitioning across platforms and shapes
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We partition the Matrix Multiplication computation space using recent advancements in generalized space filling curves (Generalized Hilbert Curves [15]), and we obtain platform-oblivious and shape-oblivious matrix-multiplication schemes that exhibit inherently high degree of data locality... extend the SFC-based work partitioning to implement Communication-Avoiding (CA) 2.5D GEMM algorithms that replicate the input tensors and provably minimize communication/data-movement
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Hilbert curves split a finite 2D space into recursive quadrants and traverse each quadrant in recursive “U” shapes... the thread decompositions that partition the 1D SFC-index space in a block fashion naturally result in thread grids with aspect ratios matching the aspect ratios of the partitioned 2D space
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Alex Krizhevsky, I. Sutskever, and G.E. Hinton. Image classification with deep convolutional neural networks.Advances in neural informa- tion processing systems, pages 1097–1105, 2012
work page 2012
-
[2]
Going deeper with convolutions
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1– 9, 2015
work page 2015
-
[3]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan and Andrew Zisserman. Very deep convolu- tional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[4]
Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks
Dong Yu, Michael L Seltzer, Jinyu Li, Jui-Ting Huang, and Frank Seide. Feature learning in deep neural networks-studies on speech recognition tasks.arXiv preprint arXiv:1301.3605, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[5]
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation.arXiv preprint arXiv:1609.08144, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[6]
Wide & deep learning for recommender systems
Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pages 7–10. ACM, 2016
work page 2016
-
[7]
Transformers: State-of-the-art natural language processing
Thomas Wolf, Julien Chaumond, Lysandre Debut, Victor Sanh, Clement Delangue, Anthony Moi, Pierric Cistac, Morgan Funtowicz, Joe Davi- son, Sam Shleifer, et al. Transformers: State-of-the-art natural language processing. InProceedings of the 2020 Conference on Empirical Meth- ods in Natural Language Processing: System Demonstrations, pages 38– 45, 2020
work page 2020
-
[8]
Deep learning in drug discovery.Molecular informatics, 35(1):3–14, 2016
Erik Gawehn, Jan A Hiss, and Gisbert Schneider. Deep learning in drug discovery.Molecular informatics, 35(1):3–14, 2016
work page 2016
-
[9]
Deep learning for computational chemistry.Journal of computational chemistry, 38(16):1291–1307, 2017
Garrett B Goh, Nathan O Hodas, and Abhinav Vishnu. Deep learning for computational chemistry.Journal of computational chemistry, 38(16):1291–1307, 2017
work page 2017
-
[10]
A survey of deep learning for scientific discovery.arXiv preprint arXiv:2003.11755, 2020
Maithra Raghu and Eric Schmidt. A survey of deep learning for scientific discovery.arXiv preprint arXiv:2003.11755, 2020
-
[11]
Edoardo Angelo Di Napoli, Paolo Bientinesi, Jiajia Li, and Andr ´e Uschmajew. High-performance tensor computations in scientific com- puting and data science.Frontiers in Applied Mathematics and Statistics, page 93, 2022
work page 2022
-
[12]
Evangelos Georganas, Dhiraj Kalamkar, Sasikanth Avancha, Menachem Adelman, Cristina Anderson, Alexander Breuer, Jeremy Bruestle, Naren- dra Chaudhary, Abhisek Kundu, Denise Kutnick, et al. Tensor processing primitives: A programming abstraction for efficiency and portability in deep learning workloads. InProceedings of the International Conference for Hig...
work page 2021
-
[14]
Harnessing deep learning via a single building block
Evangelos Georganas, Kunal Banerjee, Dhiraj Kalamkar, Sasikanth Avancha, Anand Venkat, Michael Anderson, Greg Henry, Hans Pabst, and Alexander Heinecke. Harnessing deep learning via a single building block. In2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 222–233. IEEE, 2020
work page 2020
-
[15]
Jakub ˇCerven´y. Gilbert. https://github.com/jakubcerveny/gilbert, 2019
work page 2019
-
[16]
Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms
Alexander Heinecke and Michael Bader. Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms. InProceedings of the 2008 workshop on Memory access on future processors: a solved problem?, pages 385–392, 2008
work page 2008
-
[17]
Communication-optimal parallel 2.5 d matrix multiplication and lu factorization algorithms
Edgar Solomonik and James Demmel. Communication-optimal parallel 2.5 d matrix multiplication and lu factorization algorithms. InEuropean Conference on Parallel Processing, pages 90–109. Springer, 2011
work page 2011
-
[18]
Communication avoiding and overlapping for numerical linear algebra
Evangelos Georganas, Jorge Gonz ´alez-Dom´ınguez, Edgar Solomonik, Yili Zheng, Juan Tourino, and Katherine Yelick. Communication avoiding and overlapping for numerical linear algebra. InSC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1–11. IEEE, 2012
work page 2012
-
[19]
Communication-optimal parallel recursive rectangular matrix multiplication
James Demmel, David Eliahu, Armando Fox, Shoaib Kamil, Benjamin Lipshitz, Oded Schwartz, and Omer Spillinger. Communication-optimal parallel recursive rectangular matrix multiplication. In2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pages 261–272. IEEE, 2013
work page 2013
-
[20]
Martin D Schatz, Jack Poulson, and Robert A van de Geijn. Scalable universal matrix multiplication algorithms: 2d and 3d variations on a theme.submitted to ACM Transactions on Mathematical Software, pages 1–30, 2012
work page 2012
-
[21]
Evangelos Georganas, Dhiraj Kalamkar, Kirill V oronin, Abhisek Kundu, Antonio Noack, Hans Pabst, Alexander Breuer, and Alexander Heinecke. Harnessing deep learning and hpc kernels via high-level loop and tensor abstractions on cpu architectures. In2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 950–963. IEEE, 2024
work page 2024
-
[22]
Roofline: an insightful visual performance model for multicore architectures
Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009
work page 2009
-
[23]
Libxsmm: accelerating small matrix multiplications by runtime code generation
Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. Libxsmm: accelerating small matrix multiplications by runtime code generation. InSC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 981–991. IEEE, 2016
work page 2016
- [24]
-
[25]
Intel Architecture Instruction Set Extensions and Future Features Programming Reference. https://www.intel.com/content/www/us/en/content-details/671368/intel- architecture-instruction-set-extensions-programming-reference.html, Accessed on 1/21/2026
work page 2026
-
[26]
https://github.com/oneapi-src/onednn, Accessed on 4/6/2023
Intel oneDNN. https://github.com/oneapi-src/onednn, Accessed on 4/6/2023
work page 2023
-
[27]
https://github.com/arm-software/computelibrary, Accessed on 4/6/2023
ARM. https://github.com/arm-software/computelibrary, Accessed on 4/6/2023
work page 2023
-
[28]
LIBXSMM: Accelerating small matrix multiplications by runtime code generation
Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. LIBXSMM: Accelerating small matrix multiplications by runtime code generation. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16, pages 84:1–84:11, Piscataway, NJ, USA, 2016. IEEE Press
work page 2016
-
[29]
Using bfloat16 with TensorFlow models
“Using bfloat16 with TensorFlow models”. https://cloud.google.com/tpu/docs/bfloat16, Accessed on 4/3/2019
work page 2019
-
[30]
Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication
Grzegorz Kwasniewski, Marko Kabi ´c, Maciej Besta, Joost VandeV on- dele, Raffaele Solc `a, and Torsten Hoefler. Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–22, 2019
work page 2019
-
[31]
A study of energy and locality effects using space-filling curves
Nico Reissman, Jan Christian Meyer, and Magnus Jahre. A study of energy and locality effects using space-filling curves. In2014 IEEE International Parallel & Distributed Processing Symposium Workshops, pages 815–822. IEEE, 2014
work page 2014
-
[32]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al.{TVM}: An automated end-to-end optimizing compiler for deep learning. In13th{USENIX}Symposium on Operating Systems Design and Implementation ({OSDI}18), pages 578–594, 2018
work page 2018
-
[33]
Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions
Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, An- drew Adams, and Albert Cohen. Tensor comprehensions: Framework- agnostic high-performance machine learning abstractions.arXiv preprint arXiv:1802.04730, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[34]
Taco: A tool to generate tensor algebra kernels
Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Amarasinghe. Taco: A tool to generate tensor algebra kernels. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 943–948. IEEE, 2017
work page 2017
-
[35]
onednn graph compiler: A hybrid approach for high-performance deep learning compilation
Jianhui Li, Zhennan Qin, Yijie Mei, Jingze Cui, Yunfei Song, Ciyong Chen, Yifei Zhang, Longsheng Du, Xianhang Cheng, Baihui Jin, et al. onednn graph compiler: A hybrid approach for high-performance deep learning compilation. In2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 460–470. IEEE, 2024. Optimization Notice: S...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.