Recognition: unknown
Fast GPU Linear Algebra via Compile Time Expression Fusion
Pith reviewed 2026-05-08 08:50 UTC · model grok-4.3
The pith
Bandicoot fuses linear algebra expressions into GPU kernels at compile time using templates, removing JIT overhead and often beating PyTorch, TensorFlow, and JAX while matching Armadillo's API.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Bandicoot generates fused GPU kernels directly at compile time through template metaprogramming, yielding efficient kernels that often saturate memory bandwidth without any runtime overhead or JIT infrastructure, and empirical benchmarks show it outperforms commonly used linear algebra toolkits including PyTorch, TensorFlow, and JAX while maintaining full API compatibility with Armadillo.
What carries the argument
C++ template metaprogramming that performs expression fusion at compile time to emit single, optimized GPU kernels for chains of linear algebra operations.
If this is right
- Fused kernels frequently reach the memory-bandwidth ceiling of the GPU for the tested operations.
- No separate runtime cost arises from expression tree traversal or individual kernel launches.
- Existing Armadillo CPU code can be recompiled for the GPU with essentially no source changes.
- Performance gains appear across a range of common matrix and vector operations relative to PyTorch, TensorFlow, and JAX.
Where Pith is reading between the lines
- The same compile-time fusion technique could be applied to other numerical domains that currently rely on runtime expression graphs, such as automatic differentiation pipelines.
- Developers might layer higher-level domain-specific languages on top of the Bandicoot interface without reintroducing kernel-launch overhead.
- For very large or dynamically generated expressions the compile-time approach may hit practical limits that dynamic systems avoid.
- The results suggest that static metaprogramming can serve as a viable alternative to JIT infrastructure in performance-critical numerical libraries.
Load-bearing premise
Template metaprogramming can fuse a broad range of linear algebra expressions at compile time while preserving full Armadillo API compatibility and keeping compilation times and supported expression limits practical.
What would settle it
A standard linear algebra expression that either causes Bandicoot to produce kernels below memory bandwidth on target hardware or triggers compilation times orders of magnitude longer than those of comparable JIT-based libraries.
Figures
read the original abstract
We describe the Bandicoot GPU linear algebra toolkit, a C++ based library that prioritises ease of use without compromising efficiency. Bandicoot's API is compatible with the popular Armadillo CPU linear algebra library, enabling easy transition for existing CPU-based codebases. Unlike other GPU-focused toolkits, Bandicoot uses template metaprogramming to generate fused GPU kernels directly at compile time, yielding efficient kernels that are often able to saturate memory bandwidth. This removes the need for runtime overhead or JIT infrastructure. Empirical results show that Bandicoot outperforms (sometimes by considerable margins) commonly-used linear algebra toolkits including PyTorch, TensorFlow, and JAX.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Bandicoot, a C++ GPU linear algebra library whose API is compatible with Armadillo. It uses template metaprogramming to perform expression fusion at compile time, generating fused GPU kernels that often saturate memory bandwidth without runtime overhead or JIT compilation. Empirical results are claimed to show outperformance over PyTorch, TensorFlow, and JAX.
Significance. If the performance claims and general applicability hold, the work would demonstrate a viable compile-time alternative to runtime fusion or JIT-based GPU libraries, allowing Armadillo users to transition to GPU execution with minimal code changes while avoiding dynamic overheads. This could be particularly useful for performance-critical scientific codes that already rely on expression-template interfaces.
major comments (2)
- The abstract and empirical results section state outperformance without providing benchmark details, error bars, specific test cases, data exclusion rules, or hardware configurations. This information is load-bearing for verifying the central performance claim.
- No measurements or discussion of compilation times, template instantiation depth limits, or supported expression complexity appear in the implementation or evaluation sections. This directly affects the claim of general applicability and full Armadillo API compatibility without prohibitive costs, as expression templates are known to risk exponential blowup.
minor comments (1)
- The abstract would benefit from a brief quantitative example of the performance margins achieved.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of reproducibility and practicality. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: The abstract and empirical results section state outperformance without providing benchmark details, error bars, specific test cases, data exclusion rules, or hardware configurations. This information is load-bearing for verifying the central performance claim.
Authors: We agree that the current presentation of results lacks sufficient detail for full reproducibility and verification. In the revised manuscript, we will expand the empirical results section to include comprehensive benchmark specifications: exact test cases and matrix dimensions, any data exclusion criteria, error bars derived from multiple independent runs with statistical reporting, and complete hardware configurations including GPU model, CUDA version, driver details, and host system specifications. The abstract will be updated to reference these enhanced details where appropriate. revision: yes
-
Referee: No measurements or discussion of compilation times, template instantiation depth limits, or supported expression complexity appear in the implementation or evaluation sections. This directly affects the claim of general applicability and full Armadillo API compatibility without prohibitive costs, as expression templates are known to risk exponential blowup.
Authors: The referee rightly notes the absence of compile-time analysis, which is relevant to claims of broad applicability. In the revised version, we will add a new subsection to the implementation section that discusses compilation times for the expressions used in our benchmarks, observed limits on template instantiation depth, and the range of supported fused expression complexities. We will also explain the metaprogramming techniques (such as selective fusion rules and type-based dispatch) that prevent exponential blowup in practice. Quantitative compilation time measurements from our development and testing environment will be included to demonstrate that these costs remain practical for typical linear algebra workloads. revision: yes
Circularity Check
No circularity: claims rest on implementation description and empirical benchmarks
full rationale
The paper presents a library implementation using template metaprogramming for compile-time kernel fusion, with performance claims supported by direct empirical comparisons to PyTorch, TensorFlow, and JAX. No equations, derivations, fitted parameters, or first-principles predictions appear in the abstract or described content. The central claims do not reduce to self-definitions, self-citations as load-bearing premises, or renaming of known results; they are grounded in described code generation techniques and runtime measurements that are independently verifiable outside any internal loop. This is the expected non-finding for an engineering systems paper without mathematical modeling steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: USENIX Symposium on Operating Systems Design and Implementation
Abadi, M., Barham, P., etal.: TensorFlow: A System for Large-Scale Machine Learning. In: USENIX Symposium on Operating Systems Design and Implementation. pp. 265–283 (2016)
2016
-
[2]
International Journal of High Performance Computing Applications38(5), 468–490 (2024)
Abdelfattah, A., Beams, N., Carson, R., etal.: MAGMA: Enabling exascale performance with accelerated BLAS and LAPACK for diverse GPU architectures. International Journal of High Performance Computing Applications38(5), 468–490 (2024)
2024
-
[3]
In: ACM Int
Ansel,J.,Yang,E.,He,H.,etal.:PyTorch2:FastermachinelearningthroughdynamicPython bytecode transformation and graph compilation. In: ACM Int. Conf. Architectural Support for Programming Languages and Operating Systems. vol.2, pp. 929–947 (2024)
2024
-
[4]
Astronomy and Computing1, 17–22 (2013)
Bard, D., Bellis, M., Allen, M.T., etal.: Cosmological calculations on the GPU. Astronomy and Computing1, 17–22 (2013)
2013
-
[5]
SIAM Review59(1), 65–98 (2017)
Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: A fresh approach to numerical computing. SIAM Review59(1), 65–98 (2017)
2017
-
[6]
Journal of Machine Learning Research22(1), 7552–7557 (2021)
Curtin, R.R., Edel, M., Prabhu, R.G., etal.: The ensmallen library for flexible numerical optimization. Journal of Machine Learning Research22(1), 7552–7557 (2021)
2021
-
[7]
Curtin, R.R., Edel, M., Sanderson, C.: Bandicoot: A templated C++ library for GPU linear algebra. arXiv:2508.11385 (2025)
-
[8]
Journal of Open Source Software8(82), 5026 (2023)
Curtin, R.R., Edel, M., Shrit, O., etal.: mlpack 4: a fast, header-only C++ machine learning library. Journal of Open Source Software8(82), 5026 (2023)
2023
-
[9]
Computational Statistics & Data Analysis71, 1054–1063 (2014)
Eddelbuettel, D., Sanderson, C.: RcppArmadillo: Accelerating R with high-performance C++ linear algebra. Computational Statistics & Data Analysis71, 1054–1063 (2014)
2014
-
[10]
In: Conference on Systems and Machine Learning (SysML) (2018)
Frostig, R., Johnson, M.J., Leary, C.: Compiling machine learning programs via high-level tracing. In: Conference on Systems and Machine Learning (SysML) (2018)
2018
-
[11]
Cambridge University Press, 2nd edn
Harper, R.: Practical Foundations for Programming Languages. Cambridge University Press, 2nd edn. (2016)
2016
-
[12]
In: Advances in Neural Information Processing Systems
Krizhevsky, A., etal.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems. vol.25, pp. 1097–1105 (2012)
2012
-
[13]
In: Modeling and Simulation for Defense Systems and Applications VII
Malcolm,J.,Yalamanchili,P.,McClanahan,C.,etal.:ArrayFire:aGPUaccelerationplatform. In: Modeling and Simulation for Defense Systems and Applications VII. vol.8403 (2012)
2012
-
[14]
pp.1–7 (2008)
Michalakes,J.,Vachharajani,M.:GPUaccelerationofnumericalweatherprediction.In:IEEE International Symposium on Parallel and Distributed Processing. pp.1–7 (2008)
2008
-
[15]
The Journal of Supercomputing67(2), 528–564 (2014)
Niemeyer, K.E., Sung, C.J.: Recent progress and challenges in exploiting graphics processors in computational fluid dynamics. The Journal of Supercomputing67(2), 528–564 (2014)
2014
-
[16]
ACM Transactions on Mathematical Software48(3) (2022)
Psarras,C.,etal.:Thelinearalgebramappingproblem:Currentstateoflinearalgebralanguages and libraries. ACM Transactions on Mathematical Software48(3) (2022)
2022
-
[17]
In: International Conference on Computer and Automation Engineering
Sanderson, C., Curtin, R.: Armadillo: An efficient framework for numerical linear algebra. In: International Conference on Computer and Automation Engineering. pp. 303–307 (2025)
2025
-
[18]
In: PASC Conference (2017)
Sanderson, C., Curtin, R.R.: Armadillo: C++ template metaprogramming for compile-time optimization of linear algebra. In: PASC Conference (2017)
2017
-
[19]
In: ACM SIGOPS Symposium on Operating Systems Principles
Song, Y., etal.: PowerInfer: Fast large language model serving with a consumer-grade GPU. In: ACM SIGOPS Symposium on Operating Systems Principles. pp. 590–606 (2024)
2024
-
[20]
Nature Reviews Drug Discovery18(6), 463–477 (2019)
Vamathevan, J., Clark, D., Czodrowski, P., etal.: Applications of machine learning in drug discovery and development. Nature Reviews Drug Discovery18(6), 463–477 (2019)
2019
-
[21]
Addison- Wesley, 2nd edn
Vandevoorde, D., Josuttis, N., Gregor, D.: C++ Templates: The Complete Guide. Addison- Wesley, 2nd edn. (2017)
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.