arxiv: 2604.22242 · v1 · submitted 2026-04-24 · 💻 cs.MS

Recognition: unknown

Fast GPU Linear Algebra via Compile Time Expression Fusion

Conrad Sanderson, Marcus Edel, Ryan R. Curtin

Pith reviewed 2026-05-08 08:50 UTC · model grok-4.3

classification 💻 cs.MS

keywords GPU linear algebratemplate metaprogrammingexpression fusioncompile-time optimizationArmadillo API compatibilityBandicoot libraryCUDA kernel generationmemory bandwidth saturation

0 comments

The pith

Bandicoot fuses linear algebra expressions into GPU kernels at compile time using templates, removing JIT overhead and often beating PyTorch, TensorFlow, and JAX while matching Armadillo's API.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Bandicoot as a C++ GPU linear algebra library whose API matches the established Armadillo CPU library, allowing existing codebases to move to the GPU with minimal changes. It achieves this by applying template metaprogramming to fuse entire expressions into single optimized kernels during compilation rather than at runtime. The resulting kernels frequently reach memory-bandwidth limits and deliver higher throughput than common GPU frameworks on the tested operations. A reader would care because the approach eliminates both the runtime cost of separate kernel launches and the infrastructure needed for just-in-time compilation while preserving a familiar high-level interface.

Core claim

Bandicoot generates fused GPU kernels directly at compile time through template metaprogramming, yielding efficient kernels that often saturate memory bandwidth without any runtime overhead or JIT infrastructure, and empirical benchmarks show it outperforms commonly used linear algebra toolkits including PyTorch, TensorFlow, and JAX while maintaining full API compatibility with Armadillo.

What carries the argument

C++ template metaprogramming that performs expression fusion at compile time to emit single, optimized GPU kernels for chains of linear algebra operations.

If this is right

Fused kernels frequently reach the memory-bandwidth ceiling of the GPU for the tested operations.
No separate runtime cost arises from expression tree traversal or individual kernel launches.
Existing Armadillo CPU code can be recompiled for the GPU with essentially no source changes.
Performance gains appear across a range of common matrix and vector operations relative to PyTorch, TensorFlow, and JAX.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same compile-time fusion technique could be applied to other numerical domains that currently rely on runtime expression graphs, such as automatic differentiation pipelines.
Developers might layer higher-level domain-specific languages on top of the Bandicoot interface without reintroducing kernel-launch overhead.
For very large or dynamically generated expressions the compile-time approach may hit practical limits that dynamic systems avoid.
The results suggest that static metaprogramming can serve as a viable alternative to JIT infrastructure in performance-critical numerical libraries.

Load-bearing premise

Template metaprogramming can fuse a broad range of linear algebra expressions at compile time while preserving full Armadillo API compatibility and keeping compilation times and supported expression limits practical.

What would settle it

A standard linear algebra expression that either causes Bandicoot to produce kernels below memory bandwidth on target hardware or triggers compilation times orders of magnitude longer than those of comparable JIT-based libraries.

Figures

Figures reproduced from arXiv: 2604.22242 by Conrad Sanderson, Marcus Edel, Ryan R. Curtin.

**Figure 1.** Figure 1: Implementations of the simple expression Z ← 2XY𝑇 using various toolkits. arXiv:2604.22242v1 [cs.MS] 24 Apr 2026 view at source ↗

**Figure 2.** Figure 2: An example C++ program using Bandicoot to execute linear algebra operations on a GPU. The corresponding Armadillo-based program, which uses the CPU instead of the GPU, can be obtained by replacing the code on line 1 with #include <armadillo> and on line 2 with using namespace arma; view at source ↗

**Figure 3.** Figure 3: Internal architecture and design of Bandicoot. High-level user code is converted to efficient, fused GPU kernels at program compilation time, not at runtime. Vulkan backend support is currently experimental; more backends (e.g. HIP/ROCm) are planned view at source ↗

**Figure 4.** Figure 4: A simplified version of Bandicoot’s copy skeleton kernel for the CUDA backend. The definition of each COOT_() macro is generated at compile time. program. This reduces the size of the compiled program, as only those kernels that are needed are included, and no runtime infrastructure for generating kernels is necessary. The approach relies on the use of skeleton kernels and C-like macros, which are supporte… view at source ↗

**Figure 5.** Figure 5: Average runtime (in seconds) for 50 trials of each expression, excluding warm-up costs. Bandicoot is able to fuse operations into a single kernel at compile time, resulting in execution speed that is the fastest or on par with the fastest toolkit in the vast majority of cases view at source ↗

**Figure 6.** Figure 6: Effective throughput of matrix addition as a function of the number of matrices to be added. Dashed line indicates peak possible memory bandwidth. compilation of the generated kernel. Even ArrayFire, which is used directly from C++, suffers from some overhead as it has an internally-implemented JIT compiler and does not use any template metaprogramming. Next, to demonstrate that Bandicoot’s template metapr… view at source ↗

read the original abstract

We describe the Bandicoot GPU linear algebra toolkit, a C++ based library that prioritises ease of use without compromising efficiency. Bandicoot's API is compatible with the popular Armadillo CPU linear algebra library, enabling easy transition for existing CPU-based codebases. Unlike other GPU-focused toolkits, Bandicoot uses template metaprogramming to generate fused GPU kernels directly at compile time, yielding efficient kernels that are often able to saturate memory bandwidth. This removes the need for runtime overhead or JIT infrastructure. Empirical results show that Bandicoot outperforms (sometimes by considerable margins) commonly-used linear algebra toolkits including PyTorch, TensorFlow, and JAX.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Bandicoot brings Armadillo-style API to GPU via compile-time template fusion, but the abstract leaves the performance and scaling claims hard to evaluate.

read the letter

Bandicoot is a C++ library that keeps the Armadillo API while moving linear algebra to the GPU through template metaprogramming that fuses expressions into kernels at compile time. The core idea is to skip runtime JIT or separate kernel launch overhead and let the compiler produce efficient code that can hit memory bandwidth limits. That approach is the main new piece: it targets users who already write Armadillo code and want a low-friction path to GPU without adopting a new framework or accepting dynamic overhead. If the fusion works as described, it could simplify porting for scientific and ML workloads that stay within supported expression patterns. The emphasis on full API compatibility and the absence of extra infrastructure is a practical strength for that audience. The abstract also positions the results as competitive with or better than PyTorch, TensorFlow, and JAX on the tested cases. The main soft spot is the lack of concrete benchmark information. No matrix sizes, operation mixes, hardware details, or compile-time measurements appear in the abstract, so it is difficult to judge how general the speedups are or whether they survive realistic expression depth. The stress-test concern about template instantiation limits is reasonable to raise here; expression templates often hit exponential growth or long compile times once expressions get more complex, and nothing in the provided summary shows how Bandicoot bounds that behavior or documents the supported subset. If those limits turn out narrow, the claimed advantage shrinks for typical codebases. This work is aimed at practitioners who need GPU acceleration inside existing C++ CPU codebases and who prefer static generation over runtime systems. It is worth sending for peer review because the implementation direction is concrete and the compatibility goal is useful, even though the current draft needs expanded evidence on benchmarks and compile-time scaling before the performance claims can be assessed properly.

Referee Report

2 major / 1 minor

Summary. The paper presents Bandicoot, a C++ GPU linear algebra library whose API is compatible with Armadillo. It uses template metaprogramming to perform expression fusion at compile time, generating fused GPU kernels that often saturate memory bandwidth without runtime overhead or JIT compilation. Empirical results are claimed to show outperformance over PyTorch, TensorFlow, and JAX.

Significance. If the performance claims and general applicability hold, the work would demonstrate a viable compile-time alternative to runtime fusion or JIT-based GPU libraries, allowing Armadillo users to transition to GPU execution with minimal code changes while avoiding dynamic overheads. This could be particularly useful for performance-critical scientific codes that already rely on expression-template interfaces.

major comments (2)

The abstract and empirical results section state outperformance without providing benchmark details, error bars, specific test cases, data exclusion rules, or hardware configurations. This information is load-bearing for verifying the central performance claim.
No measurements or discussion of compilation times, template instantiation depth limits, or supported expression complexity appear in the implementation or evaluation sections. This directly affects the claim of general applicability and full Armadillo API compatibility without prohibitive costs, as expression templates are known to risk exponential blowup.

minor comments (1)

The abstract would benefit from a brief quantitative example of the performance margins achieved.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of reproducibility and practicality. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: The abstract and empirical results section state outperformance without providing benchmark details, error bars, specific test cases, data exclusion rules, or hardware configurations. This information is load-bearing for verifying the central performance claim.

Authors: We agree that the current presentation of results lacks sufficient detail for full reproducibility and verification. In the revised manuscript, we will expand the empirical results section to include comprehensive benchmark specifications: exact test cases and matrix dimensions, any data exclusion criteria, error bars derived from multiple independent runs with statistical reporting, and complete hardware configurations including GPU model, CUDA version, driver details, and host system specifications. The abstract will be updated to reference these enhanced details where appropriate. revision: yes
Referee: No measurements or discussion of compilation times, template instantiation depth limits, or supported expression complexity appear in the implementation or evaluation sections. This directly affects the claim of general applicability and full Armadillo API compatibility without prohibitive costs, as expression templates are known to risk exponential blowup.

Authors: The referee rightly notes the absence of compile-time analysis, which is relevant to claims of broad applicability. In the revised version, we will add a new subsection to the implementation section that discusses compilation times for the expressions used in our benchmarks, observed limits on template instantiation depth, and the range of supported fused expression complexities. We will also explain the metaprogramming techniques (such as selective fusion rules and type-based dispatch) that prevent exponential blowup in practice. Quantitative compilation time measurements from our development and testing environment will be included to demonstrate that these costs remain practical for typical linear algebra workloads. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on implementation description and empirical benchmarks

full rationale

The paper presents a library implementation using template metaprogramming for compile-time kernel fusion, with performance claims supported by direct empirical comparisons to PyTorch, TensorFlow, and JAX. No equations, derivations, fitted parameters, or first-principles predictions appear in the abstract or described content. The central claims do not reduce to self-definitions, self-citations as load-bearing premises, or renaming of known results; they are grounded in described code generation techniques and runtime measurements that are independently verifiable outside any internal loop. This is the expected non-finding for an engineering systems paper without mathematical modeling steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering paper describing a software library implementation; no free parameters, mathematical axioms, or invented scientific entities are involved.

pith-pipeline@v0.9.0 · 5402 in / 1093 out tokens · 51263 ms · 2026-05-08T08:50:40.152106+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 1 canonical work pages

[1]

In: USENIX Symposium on Operating Systems Design and Implementation

Abadi, M., Barham, P., etal.: TensorFlow: A System for Large-Scale Machine Learning. In: USENIX Symposium on Operating Systems Design and Implementation. pp. 265–283 (2016)

2016
[2]

International Journal of High Performance Computing Applications38(5), 468–490 (2024)

Abdelfattah, A., Beams, N., Carson, R., etal.: MAGMA: Enabling exascale performance with accelerated BLAS and LAPACK for diverse GPU architectures. International Journal of High Performance Computing Applications38(5), 468–490 (2024)

2024
[3]

In: ACM Int

Ansel,J.,Yang,E.,He,H.,etal.:PyTorch2:FastermachinelearningthroughdynamicPython bytecode transformation and graph compilation. In: ACM Int. Conf. Architectural Support for Programming Languages and Operating Systems. vol.2, pp. 929–947 (2024)

2024
[4]

Astronomy and Computing1, 17–22 (2013)

Bard, D., Bellis, M., Allen, M.T., etal.: Cosmological calculations on the GPU. Astronomy and Computing1, 17–22 (2013)

2013
[5]

SIAM Review59(1), 65–98 (2017)

Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: A fresh approach to numerical computing. SIAM Review59(1), 65–98 (2017)

2017
[6]

Journal of Machine Learning Research22(1), 7552–7557 (2021)

Curtin, R.R., Edel, M., Prabhu, R.G., etal.: The ensmallen library for flexible numerical optimization. Journal of Machine Learning Research22(1), 7552–7557 (2021)

2021
[7]

arXiv:2508.11385 (2025)

Curtin, R.R., Edel, M., Sanderson, C.: Bandicoot: A templated C++ library for GPU linear algebra. arXiv:2508.11385 (2025)

work page arXiv 2025
[8]

Journal of Open Source Software8(82), 5026 (2023)

Curtin, R.R., Edel, M., Shrit, O., etal.: mlpack 4: a fast, header-only C++ machine learning library. Journal of Open Source Software8(82), 5026 (2023)

2023
[9]

Computational Statistics & Data Analysis71, 1054–1063 (2014)

Eddelbuettel, D., Sanderson, C.: RcppArmadillo: Accelerating R with high-performance C++ linear algebra. Computational Statistics & Data Analysis71, 1054–1063 (2014)

2014
[10]

In: Conference on Systems and Machine Learning (SysML) (2018)

Frostig, R., Johnson, M.J., Leary, C.: Compiling machine learning programs via high-level tracing. In: Conference on Systems and Machine Learning (SysML) (2018)

2018
[11]

Cambridge University Press, 2nd edn

Harper, R.: Practical Foundations for Programming Languages. Cambridge University Press, 2nd edn. (2016)

2016
[12]

In: Advances in Neural Information Processing Systems

Krizhevsky, A., etal.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems. vol.25, pp. 1097–1105 (2012)

2012
[13]

In: Modeling and Simulation for Defense Systems and Applications VII

Malcolm,J.,Yalamanchili,P.,McClanahan,C.,etal.:ArrayFire:aGPUaccelerationplatform. In: Modeling and Simulation for Defense Systems and Applications VII. vol.8403 (2012)

2012
[14]

pp.1–7 (2008)

Michalakes,J.,Vachharajani,M.:GPUaccelerationofnumericalweatherprediction.In:IEEE International Symposium on Parallel and Distributed Processing. pp.1–7 (2008)

2008
[15]

The Journal of Supercomputing67(2), 528–564 (2014)

Niemeyer, K.E., Sung, C.J.: Recent progress and challenges in exploiting graphics processors in computational fluid dynamics. The Journal of Supercomputing67(2), 528–564 (2014)

2014
[16]

ACM Transactions on Mathematical Software48(3) (2022)

Psarras,C.,etal.:Thelinearalgebramappingproblem:Currentstateoflinearalgebralanguages and libraries. ACM Transactions on Mathematical Software48(3) (2022)

2022
[17]

In: International Conference on Computer and Automation Engineering

Sanderson, C., Curtin, R.: Armadillo: An efficient framework for numerical linear algebra. In: International Conference on Computer and Automation Engineering. pp. 303–307 (2025)

2025
[18]

In: PASC Conference (2017)

Sanderson, C., Curtin, R.R.: Armadillo: C++ template metaprogramming for compile-time optimization of linear algebra. In: PASC Conference (2017)

2017
[19]

In: ACM SIGOPS Symposium on Operating Systems Principles

Song, Y., etal.: PowerInfer: Fast large language model serving with a consumer-grade GPU. In: ACM SIGOPS Symposium on Operating Systems Principles. pp. 590–606 (2024)

2024
[20]

Nature Reviews Drug Discovery18(6), 463–477 (2019)

Vamathevan, J., Clark, D., Czodrowski, P., etal.: Applications of machine learning in drug discovery and development. Nature Reviews Drug Discovery18(6), 463–477 (2019)

2019
[21]

Addison- Wesley, 2nd edn

Vandevoorde, D., Josuttis, N., Gregor, D.: C++ Templates: The Complete Guide. Addison- Wesley, 2nd edn. (2017)

2017