arxiv: 2601.16294 · v2 · submitted 2026-01-22 · 💻 cs.DC · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Space Filling Curves is All You Need: Communication-Avoiding Matrix Multiplication Made Simple

Evangelos Georganas , Alexander Heinecke , Pradeep Dubey

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:34 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords space filling curvesGEMMcommunication avoiding algorithmsmatrix multiplicationdata localityhigh performance computingCPU performance

0 comments

The pith

Space-filling curves enable simple communication-avoiding matrix multiplication without platform-specific tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes using space-filling curves to partition general matrix multiplication in a way that delivers high data locality regardless of matrix shape or target CPU. This partitioning extends directly to communication-avoiding algorithms that minimize data movement across the memory hierarchy. The resulting code remains compact yet achieves performance that matches or exceeds specialized vendor libraries on multiple platforms. Real-world tests show the approach speeds up large language model inference prefill and distributed matrix multiplication.

Core claim

Space-filling curve partitioning yields platform-oblivious and shape-oblivious matrix multiplication schemes with high data locality, and extends seamlessly to provably optimal communication-avoiding algorithms that minimize data movement in compact code.

What carries the argument

Space-filling curves for work partitioning that enforce data locality and extend to communication-avoiding GEMM

If this is right

Achieves up to 5.5x speedup over vendor libraries for certain GEMM shapes with 1.8x weighted harmonic mean improvement.
Provides up to 1.85x speedup when used as backend for LLM inference prefill.
Delivers up to 2.2x speedup for distributed-memory matrix multiplication.
Reduces reliance on exhaustive tuning of tensor layouts, parallelization, and cache blocking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same partitioning principle could simplify optimization for other dense linear algebra kernels such as triangular solves or factorizations.
Platform-oblivious methods may reduce porting effort when deploying HPC codes across evolving CPU architectures.
Compact implementations like this could lower the barrier for custom high-performance backends in emerging AI frameworks.

Load-bearing premise

Space-filling curve partitioning inherently delivers high data locality for arbitrary matrix shapes and platforms, allowing seamless extension to provably optimal communication-avoiding algorithms without hidden platform-specific adjustments.

What would settle it

A measurement on any tested CPU showing the SFC implementation moves more data or runs slower than a vendor library for a range of GEMM shapes would falsify the claim of consistent outperformance and simplicity.

Figures

Figures reproduced from arXiv: 2601.16294 by Alexander Heinecke, Evangelos Georganas, Pradeep Dubey.

**Figure 2.** Figure 2: SFC-based traversal of 16×16 grid. The SFC used is the generalized 2D Hilbert curve. The numbers in the grid boxes indicate the order (i.e. timestamp) of the traversal, also color-coded according to the heatmap bar on the right. II. SFC-BASED COMMUNICATION-AVOIDING GEMM A. Generalized Hilbert SFC The discrete Hilbert curve is a widely used space-filling curve to map between N-dimensional and 1-D spaces whi… view at source ↗

**Figure 3.** Figure 3: SFC-based partitioning of C matrix using 64 cores. The C dimensions are 4096×4096, and by using blocks of size 32×32 we obtain a grid of 128×128 C blocks. Left: 2D C decomposition, the SFC yields a 2D core decomposition (grid of 8×8 cores). Each core is implicitly assigned a C tile consisting of 16×16 C blocks (each having size 32×32). Each corelocal grid of 16×16 C blocks/C-tasks is traversed by the core… view at source ↗

**Figure 4.** Figure 4: SFC-based partitioning of C matrices with different M:N aspect ratios using 64 cores. The SFC-based partitioning yields 2D core decompositions with aspect ratios matching the corresponding M:N ratio. proper loop-index mapping in lines 12-14. E. Communication optimality of SFC-CA GEMM Algorithm In this subsection we sketch out how our SFC-CA GEMM algorithm achieves asymptotically the lower bound mentioned i… view at source ↗

**Figure 3.** Figure 3: Assuming we have more available vacant space in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: EMR BF16 GEMM performance for oneDNN and SFC-CA GEMM for a range of matrices ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Total L2 misses on EMR for two problem sizes. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: GNR BF16 GEMM performance for oneDNN and SFC-CA GEMM for a range of [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: ZEN5 BF16 GEMM performance for oneDNN and SFC-CA GEMM for a range of [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: GVT4 BF16 GEMM for ACL (via the oneDNN frontend) and SFC-CA GEMM for a range of [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

read the original abstract

General Matrix Multiplication (GEMM) is the cornerstone of HPC workloads and Deep Learning. State-of-the-art vendor libraries tune tensor layouts, parallelization schemes, and cache blocking to minimize data movement across the memory hierarchy and maximize throughput. Optimal settings for these parameters depend on the target platform and matrix shapes, making exhaustive tuning infeasible. We revisit Space Filling Curves (SFC) to alleviate this cumbersome tuning. We partition the Matrix Multiplication using advancements in SFC, and obtain platform-oblivious and shape-oblivious Matrix Multiplication schemes with high degree of data locality. We extend the SFC-based work partitioning to implement Communication-Avoiding (CA) algorithms that provably minimize data movement. The integration of CA-algorithms is seamless with compact code, achieving state-of-the-art results on multiple CPU platforms, outperforming vendor libraries up to 5.5x for a range of GEMM-shapes (1.8x Weighted Harmonic Mean speedup). We show the impact of our work on two real-world applications by leveraging our GEMM as compute backend: i) prefill of LLM inference with speedups up to 1.85x over State-Of-The-Art, and ii) distributed-memory Matrix Multiplication with speedups up to 2.2x.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SFC for communication-avoiding GEMM simplifies things but needs a clear proof of optimality for arbitrary shapes.

read the letter

Colleague, The punchline on this paper is that space-filling curves provide a straightforward way to partition matrix multiplication for high data locality without needing shape or platform specific tuning, and they extend it to communication-avoiding versions with reported speedups. What is new is the specific use of SFC advancements for work partitioning in GEMM to achieve both platform-oblivious and shape-oblivious schemes. They integrate this with CA algorithms in a seamless manner, resulting in compact code. The paper does well in demonstrating practical benefits: speedups up to 5.5x over vendor libraries on multiple CPUs for various GEMM shapes, with a weighted harmonic mean of 1.8x. They also show impact on LLM inference prefill (up to 1.85x) and distributed-memory MM (up to 2.2x). These empirical results on real applications are solid evidence of utility. The soft spot is the theoretical claim. The paper states that the SFC-based partitioning implements CA algorithms that provably minimize data movement. However, there is no detailed derivation in the abstract or summary showing how the communication volume for general matrix dimensions matches or approaches the known lower bounds, such as those from Ballard et al. for arbitrary shapes. If the full manuscript includes a section deriving the volume as a function of m, n, k and comparing it, that would address the concern. Otherwise, it may be that the optimality is assumed from the locality properties rather than proven explicitly. This paper is for researchers and engineers in high-performance computing focused on linear algebra kernels. A reader interested in simpler, portable implementations of GEMM would find value in the approach and the performance numbers. It has enough substance in the results to deserve a serious referee, even if revisions are needed for the proof details. I recommend sending it to peer review to get feedback on both the implementation and the theoretical backing.

Referee Report

1 major / 1 minor

Summary. The paper proposes using Space Filling Curves (SFC) for work partitioning in General Matrix Multiplication (GEMM) to achieve high data locality that is platform- and shape-oblivious. It extends this to Communication-Avoiding (CA) algorithms asserted to provably minimize data movement, implemented with compact code. Empirical results on multiple CPU platforms show speedups up to 5.5x over vendor libraries (1.8x weighted harmonic mean), with applications to LLM inference prefill (up to 1.85x) and distributed-memory MM (up to 2.2x).

Significance. If the SFC partitioning indeed yields provably communication-optimal CA-GEMM for general shapes without hidden platform adjustments, the work would offer a simple, tuning-free alternative to complex cache-blocking and parallelization schemes in vendor libraries, with clear practical impact on HPC and deep learning workloads.

major comments (1)

[Abstract] Abstract: the central claim that SFC-based partitioning extends to CA algorithms that 'provably minimize data movement' lacks any derivation of induced communication volume as a function of m,n,k and local memory M, or comparison against known lower bounds such as Ω(mnk/√M) from Ballard et al. or Irony et al. This is load-bearing for the 'provably' qualifier and the extension to CA-GEMM.

minor comments (1)

The experimental claims reference specific speedups and GEMM shapes but provide no details on methodology, matrix dimensions tested, or error bars, limiting assessment of generality.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the need to substantiate the 'provably' claim in the abstract. We address the point directly below and will revise the manuscript to include the requested derivation and comparison.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that SFC-based partitioning extends to CA algorithms that 'provably minimize data movement' lacks any derivation of induced communication volume as a function of m,n,k and local memory M, or comparison against known lower bounds such as Ω(mnk/√M) from Ballard et al. or Irony et al. This is load-bearing for the 'provably' qualifier and the extension to CA-GEMM.

Authors: We agree that the abstract claim requires explicit support. The body of the manuscript (Section 4) derives the communication volume induced by the SFC partitioning: for a local memory of size M the volume is at most 3mnk/√M + O(mn + nk + mk), which matches the Ω(mnk/√M) lower bound of Ballard et al. and Irony et al. up to lower-order terms. We will revise the abstract to state this bound explicitly and add a short paragraph in Section 4 that directly compares the SFC volume against the known lower bounds, thereby justifying the 'provably' qualifier for the CA-GEMM extension. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic construction is self-contained

full rationale

The paper constructs GEMM partitioning via space-filling curves and extends it to CA variants by describing the partitioning scheme and its locality properties. No equation reduces a claimed prediction or optimality result to a fitted parameter or prior self-citation by construction. The 'provably minimize' phrasing is presented as following from the SFC properties without an explicit lower-bound derivation in the provided text, but this absence does not create a circular reduction; it is simply an unproven claim rather than a self-referential one. Empirical speedups are reported as measurements, not as inputs to the derivation. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated assumption that space-filling curves deliver high data locality when used for work partitioning and that this property extends directly to provably optimal communication-avoiding schemes; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption Space-filling curves provide high data locality for matrix multiplication partitioning across platforms and shapes
Invoked as the basis for platform-oblivious and shape-oblivious schemes in the abstract.

pith-pipeline@v0.9.0 · 5534 in / 1307 out tokens · 35000 ms · 2026-05-16T11:34:14.888675+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We partition the Matrix Multiplication computation space using recent advancements in generalized space filling curves (Generalized Hilbert Curves [15]), and we obtain platform-oblivious and shape-oblivious matrix-multiplication schemes that exhibit inherently high degree of data locality... extend the SFC-based work partitioning to implement Communication-Avoiding (CA) 2.5D GEMM algorithms that replicate the input tensors and provably minimize communication/data-movement
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Hilbert curves split a finite 2D space into recursive quadrants and traverse each quadrant in recursive “U” shapes... the thread decompositions that partition the 1D SFC-index space in a block fashion naturally result in thread grids with aspect ratios matching the aspect ratios of the partitioned 2D space

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 4 internal anchors

[1]

Sutskever, and G.E

Alex Krizhevsky, I. Sutskever, and G.E. Hinton. Image classification with deep convolutional neural networks.Advances in neural informa- tion processing systems, pages 1097–1105, 2012

work page 2012
[2]

Going deeper with convolutions

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1– 9, 2015

work page 2015
[3]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolu- tional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[4]

Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks

Dong Yu, Michael L Seltzer, Jinyu Li, Jui-Ting Huang, and Frank Seide. Feature learning in deep neural networks-studies on speech recognition tasks.arXiv preprint arXiv:1301.3605, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[5]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation.arXiv preprint arXiv:1609.08144, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[6]

Wide & deep learning for recommender systems

Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pages 7–10. ACM, 2016

work page 2016
[7]

Transformers: State-of-the-art natural language processing

Thomas Wolf, Julien Chaumond, Lysandre Debut, Victor Sanh, Clement Delangue, Anthony Moi, Pierric Cistac, Morgan Funtowicz, Joe Davi- son, Sam Shleifer, et al. Transformers: State-of-the-art natural language processing. InProceedings of the 2020 Conference on Empirical Meth- ods in Natural Language Processing: System Demonstrations, pages 38– 45, 2020

work page 2020
[8]

Deep learning in drug discovery.Molecular informatics, 35(1):3–14, 2016

Erik Gawehn, Jan A Hiss, and Gisbert Schneider. Deep learning in drug discovery.Molecular informatics, 35(1):3–14, 2016

work page 2016
[9]

Deep learning for computational chemistry.Journal of computational chemistry, 38(16):1291–1307, 2017

Garrett B Goh, Nathan O Hodas, and Abhinav Vishnu. Deep learning for computational chemistry.Journal of computational chemistry, 38(16):1291–1307, 2017

work page 2017
[10]

A survey of deep learning for scientific discovery.arXiv preprint arXiv:2003.11755, 2020

Maithra Raghu and Eric Schmidt. A survey of deep learning for scientific discovery.arXiv preprint arXiv:2003.11755, 2020

work page arXiv 2003
[11]

High-performance tensor computations in scientific com- puting and data science.Frontiers in Applied Mathematics and Statistics, page 93, 2022

Edoardo Angelo Di Napoli, Paolo Bientinesi, Jiajia Li, and Andr ´e Uschmajew. High-performance tensor computations in scientific com- puting and data science.Frontiers in Applied Mathematics and Statistics, page 93, 2022

work page 2022
[12]

Tensor processing primitives: A programming abstraction for efficiency and portability in deep learning workloads

Evangelos Georganas, Dhiraj Kalamkar, Sasikanth Avancha, Menachem Adelman, Cristina Anderson, Alexander Breuer, Jeremy Bruestle, Naren- dra Chaudhary, Abhisek Kundu, Denise Kutnick, et al. Tensor processing primitives: A programming abstraction for efficiency and portability in deep learning workloads. InProceedings of the International Conference for Hig...

work page 2021
[14]

Harnessing deep learning via a single building block

Evangelos Georganas, Kunal Banerjee, Dhiraj Kalamkar, Sasikanth Avancha, Anand Venkat, Michael Anderson, Greg Henry, Hans Pabst, and Alexander Heinecke. Harnessing deep learning via a single building block. In2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 222–233. IEEE, 2020

work page 2020
[15]

Jakub ˇCerven´y. Gilbert. https://github.com/jakubcerveny/gilbert, 2019

work page 2019
[16]

Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms

Alexander Heinecke and Michael Bader. Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms. InProceedings of the 2008 workshop on Memory access on future processors: a solved problem?, pages 385–392, 2008

work page 2008
[17]

Communication-optimal parallel 2.5 d matrix multiplication and lu factorization algorithms

Edgar Solomonik and James Demmel. Communication-optimal parallel 2.5 d matrix multiplication and lu factorization algorithms. InEuropean Conference on Parallel Processing, pages 90–109. Springer, 2011

work page 2011
[18]

Communication avoiding and overlapping for numerical linear algebra

Evangelos Georganas, Jorge Gonz ´alez-Dom´ınguez, Edgar Solomonik, Yili Zheng, Juan Tourino, and Katherine Yelick. Communication avoiding and overlapping for numerical linear algebra. InSC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1–11. IEEE, 2012

work page 2012
[19]

Communication-optimal parallel recursive rectangular matrix multiplication

James Demmel, David Eliahu, Armando Fox, Shoaib Kamil, Benjamin Lipshitz, Oded Schwartz, and Omer Spillinger. Communication-optimal parallel recursive rectangular matrix multiplication. In2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pages 261–272. IEEE, 2013

work page 2013
[20]

Scalable universal matrix multiplication algorithms: 2d and 3d variations on a theme.submitted to ACM Transactions on Mathematical Software, pages 1–30, 2012

Martin D Schatz, Jack Poulson, and Robert A van de Geijn. Scalable universal matrix multiplication algorithms: 2d and 3d variations on a theme.submitted to ACM Transactions on Mathematical Software, pages 1–30, 2012

work page 2012
[21]

Harnessing deep learning and hpc kernels via high-level loop and tensor abstractions on cpu architectures

Evangelos Georganas, Dhiraj Kalamkar, Kirill V oronin, Abhisek Kundu, Antonio Noack, Hans Pabst, Alexander Breuer, and Alexander Heinecke. Harnessing deep learning and hpc kernels via high-level loop and tensor abstractions on cpu architectures. In2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 950–963. IEEE, 2024

work page 2024
[22]

Roofline: an insightful visual performance model for multicore architectures

Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009

work page 2009
[23]

Libxsmm: accelerating small matrix multiplications by runtime code generation

Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. Libxsmm: accelerating small matrix multiplications by runtime code generation. InSC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 981–991. IEEE, 2016

work page 2016
[24]

archbench

Alexander Heinecke. archbench. https://github.com/alheinecke/ ArchBenchSuite, 2026

work page 2026
[25]

https://www.intel.com/content/www/us/en/content-details/671368/intel- architecture-instruction-set-extensions-programming-reference.html, Accessed on 1/21/2026

Intel Architecture Instruction Set Extensions and Future Features Programming Reference. https://www.intel.com/content/www/us/en/content-details/671368/intel- architecture-instruction-set-extensions-programming-reference.html, Accessed on 1/21/2026

work page 2026
[26]

https://github.com/oneapi-src/onednn, Accessed on 4/6/2023

Intel oneDNN. https://github.com/oneapi-src/onednn, Accessed on 4/6/2023

work page 2023
[27]

https://github.com/arm-software/computelibrary, Accessed on 4/6/2023

ARM. https://github.com/arm-software/computelibrary, Accessed on 4/6/2023

work page 2023
[28]

LIBXSMM: Accelerating small matrix multiplications by runtime code generation

Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. LIBXSMM: Accelerating small matrix multiplications by runtime code generation. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16, pages 84:1–84:11, Piscataway, NJ, USA, 2016. IEEE Press

work page 2016
[29]

Using bfloat16 with TensorFlow models

“Using bfloat16 with TensorFlow models”. https://cloud.google.com/tpu/docs/bfloat16, Accessed on 4/3/2019

work page 2019
[30]

Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication

Grzegorz Kwasniewski, Marko Kabi ´c, Maciej Besta, Joost VandeV on- dele, Raffaele Solc `a, and Torsten Hoefler. Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–22, 2019

work page 2019
[31]

A study of energy and locality effects using space-filling curves

Nico Reissman, Jan Christian Meyer, and Magnus Jahre. A study of energy and locality effects using space-filling curves. In2014 IEEE International Parallel & Distributed Processing Symposium Workshops, pages 815–822. IEEE, 2014

work page 2014
[32]

In13th{USENIX}Symposium on Operating Systems Design and Implementation ({OSDI}18), pages 578–594, 2018

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al.{TVM}: An automated end-to-end optimizing compiler for deep learning. In13th{USENIX}Symposium on Operating Systems Design and Implementation ({OSDI}18), pages 578–594, 2018

work page 2018
[33]

Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions

Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, An- drew Adams, and Albert Cohen. Tensor comprehensions: Framework- agnostic high-performance machine learning abstractions.arXiv preprint arXiv:1802.04730, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[34]

Taco: A tool to generate tensor algebra kernels

Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Amarasinghe. Taco: A tool to generate tensor algebra kernels. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 943–948. IEEE, 2017

work page 2017
[35]

onednn graph compiler: A hybrid approach for high-performance deep learning compilation

Jianhui Li, Zhennan Qin, Yijie Mei, Jingze Cui, Yunfei Song, Ciyong Chen, Yifei Zhang, Longsheng Du, Xianhang Cheng, Baihui Jin, et al. onednn graph compiler: A hybrid approach for high-performance deep learning compilation. In2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 460–470. IEEE, 2024. Optimization Notice: S...

work page 2024