arxiv: 2604.05066 · v1 · submitted 2026-04-06 · 💻 cs.PL · cs.AI· cs.PF

Recognition: no theorem link

AutoLALA: Automatic Loop Algebraic Locality Analysis for AI and HPC Kernels

Chen Ding, Yanghui Wu, Yekai Pan, Yifan Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:07 UTC · model grok-4.3

classification 💻 cs.PL cs.AIcs.PF

keywords AutoLALAreuse distancepolyhedral analysisdata localityaffine loopsdata movement complexitysymbolic formulasHPC kernels

0 comments

The pith

AutoLALA derives closed-form symbolic formulas for reuse distance in arbitrary affine loop nests by mapping access spaces under access maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AutoLALA as a tool that takes programs in a small DSL for HPC and AI kernels, lowers them to polyhedral sets and maps through affine transformations, and outputs symbolic formulas for reuse distance and data movement complexity. It computes reuse distance directly as the image of the access space under the access map and derives distributions via sequences of Barvinok counting operations. This approach replaces stack simulation or recursive working-set methods for workloads such as matrix multiplication, tensor contraction, stencil computation, and einsum expressions. A reader would care because data movement through the memory hierarchy is the dominant cost in these programs, and closed-form formulas enable analysis without executing the code or enumerating instances.

Core claim

AutoLALA produces closed-form symbolic formulas for reuse distance and data movement complexity for arbitrary affine loop nests. It does so by lowering DSL programs to polyhedral timestamp spaces and access maps, then computing reuse distance as the image of the access space under the access map and extracting reuse-interval and reuse-distance distributions through Barvinok counting operations.

What carries the argument

The polyhedral lowering pipeline that builds timestamp spaces and access maps from the DSL via affine transformations, combined with Barvinok counting to produce symbolic reuse-interval and reuse-distance distributions.

If this is right

The tool covers tensor contractions, einsum operations, stencil computations, and general polyhedral programs through its DSL.
It generates symbolic formulas instead of relying on stack simulation or Denning's recursive working-set formulation.
Implementation in Rust with safe bindings to the Barvinok library supports both command-line use and an interactive web playground with LaTeX output.
The analysis applies to any affine loop nest that can be expressed in the DSL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Compilers could invoke the same lowering and counting pipeline to choose tile sizes or loop orders that minimize the derived data-movement expressions.
The symbolic formulas allow designers to compare locality cost across kernel variants before any code is written or run.
Extending the DSL to additional language constructs would widen the set of AI and HPC programs that receive exact reuse-distance analysis.

Load-bearing premise

Programs written in the DSL can be lowered to polyhedral sets and maps through affine transformations without changing the semantics that determine correct reuse distances.

What would settle it

Take a concrete affine loop nest such as a 2D matrix multiplication, obtain its symbolic reuse-distance formula from AutoLALA, and compare the formula's predicted reuse distances against reuse distances measured from an instrumented execution trace of the identical nest.

read the original abstract

Data movement is the primary bottleneck in modern computing systems. For loop-based programs common in high-performance computing (HPC) and AI workloads, including matrix multiplication, tensor contraction, stencil computation, and einsum operations, the cost of moving data through the memory hierarchy often exceeds the cost of arithmetic. This paper presents AutoLALA, an open-source tool that analyzes data locality in affine loop programs. The tool accepts programs written in a small domain-specific language (DSL), lowers them to polyhedral sets and maps, and produces closed-form symbolic formulas for reuse distance and data movement complexity. AutoLALA implements the fully symbolic locality analysis of Zhu et al. together with the data movement distance (DMD) framework of Smith et al. In particular, it computes reuse distance as the image of the access space under the access map, avoiding both stack simulation and Denning's recursive working-set formulation. We describe the DSL syntax and its formal semantics, the polyhedral lowering pipeline that constructs timestamp spaces and access maps via affine transformations, and the sequence of Barvinok counting operations used to derive symbolic reuse-interval and reuse-distance distributions. The system is implemented in Rust as a modular library spanning three crates, with safe bindings to the Barvinok library. We provide both a command-line interface and an interactive web playground with LaTeX rendering of the output formulas. The tool handles arbitrary affine loop nests, covering workloads such as tensor contractions, einsum expressions, stencil computations, and general polyhedral programs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoLALA is a practical Rust implementation that automates existing polyhedral reuse-distance analysis for affine loops, but its value rests on whether the DSL lowering preserves exact semantics.

read the letter

The paper's core contribution is a working tool that takes a small DSL for loop nests, lowers them to polyhedral sets and maps, and emits closed-form symbolic formulas for reuse distance and data movement using Barvinok counting. It packages the analysis from Zhu et al. and the DMD framework from Smith et al. into something people can actually run via CLI or web interface, with Rust crates and safe bindings to the counting library. That is the new part: not the math, but the automation and the interfaces for HPC and AI kernels like tensor contractions and stencils.

Referee Report

1 major / 2 minor

Summary. The paper presents AutoLALA, an open-source Rust tool that accepts affine loop programs written in a small DSL, lowers them to polyhedral sets and maps via affine transformations (constructing timestamp spaces and access maps), and applies Barvinok counting operations to derive closed-form symbolic formulas for reuse distance (defined as the image of the access space under the access map) and data movement complexity. It implements the symbolic locality analysis of Zhu et al. together with the DMD framework of Smith et al., and provides both a CLI and an interactive web playground with LaTeX output. The tool is claimed to handle arbitrary affine loop nests including tensor contractions, einsum, stencils, and general polyhedral programs.

Significance. If the lowering step preserves semantics and the Barvinok-derived formulas are correct, AutoLALA would provide a useful automated, simulation-free method for obtaining symbolic reuse-distance and data-movement expressions for HPC and AI kernels. The open-source implementation with safe Barvinok bindings and the interactive interface are practical strengths that could support performance modeling and optimization work.

major comments (1)

[polyhedral lowering pipeline (as described in the abstract and § on DSL semantics and lowering)] The central claim requires that every DSL program lowers to polyhedral sets/maps whose iteration spaces, timestamp ordering, and access functions are semantically equivalent to the original loop nest. The manuscript describes the polyhedral lowering pipeline and claims formal DSL semantics, yet provides no explicit equivalence proof, semantic-preservation argument, or exhaustive case analysis for the lowering step itself (including handling of imperfect nesting, multi-dimensional indexing, and loop bounds). This is load-bearing because any mismatch would invalidate the subsequent image computation and Barvinok counting for reuse distances and DMD formulas.

minor comments (2)

The manuscript would be strengthened by including at least one fully worked example (DSL source, lowered polyhedral sets/maps, intermediate Barvinok counts, and final symbolic formula) to allow readers to verify the end-to-end pipeline.
Notation for timestamp spaces, access maps, and the image operation should be introduced with explicit definitions or references to the cited Zhu et al. and Smith et al. papers to improve readability for readers unfamiliar with the underlying frameworks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for highlighting the importance of semantic equivalence in the polyhedral lowering pipeline. We address this point directly below and will strengthen the manuscript accordingly.

read point-by-point responses

Referee: The central claim requires that every DSL program lowers to polyhedral sets/maps whose iteration spaces, timestamp ordering, and access functions are semantically equivalent to the original loop nest. The manuscript describes the polyhedral lowering pipeline and claims formal DSL semantics, yet provides no explicit equivalence proof, semantic-preservation argument, or exhaustive case analysis for the lowering step itself (including handling of imperfect nesting, multi-dimensional indexing, and loop bounds). This is load-bearing because any mismatch would invalidate the subsequent image computation and Barvinok counting for reuse distances and DMD formulas.

Authors: We agree that an explicit semantic-preservation argument is necessary to support the central claim. The manuscript describes the DSL syntax, its formal semantics, and the lowering pipeline that constructs timestamp spaces and access maps via affine transformations, but it does not contain a detailed proof or exhaustive case analysis covering imperfect nesting, multi-dimensional indexing, and loop bounds. In the revised version we will add a dedicated subsection to the lowering section that provides a semantic-preservation argument. This subsection will explain how the affine transformations maintain iteration ordering and access functions equivalent to the original DSL semantics, with explicit arguments and examples for the constructs mentioned by the referee (imperfect nests, multi-dimensional indexing, and general loop bounds). We will also reference the formal DSL semantics already present in the paper to ground the argument. This addition will directly address the load-bearing concern for the subsequent Barvinok-based computations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the symbolic locality analysis pipeline

full rationale

The paper presents AutoLALA as a tool that accepts a DSL, lowers programs to polyhedral sets/maps via affine transformations with claimed formal semantics, and computes reuse distance explicitly as the image of the access space under the access map using Barvinok counting operations. This chain relies on external libraries (Barvinok) and cited frameworks (Zhu et al. for symbolic locality, Smith et al. for DMD), but performs concrete transformations and counting rather than reducing any result to its inputs by definition or self-referential fitting. No predictions are made from fitted parameters, no ansatz is smuggled, and self-citations do not bear the load of the core computation without independent verification through the described pipeline and library. The derivation is self-contained against the stated polyhedral model.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that affine loop programs can be faithfully represented in the DSL and lowered to polyhedral models; no free parameters or invented entities are mentioned.

axioms (2)

domain assumption Input programs are affine loop nests that can be expressed in the provided DSL
The tool accepts programs written in a small domain-specific language for affine loops.
domain assumption Polyhedral sets and maps obtained via affine transformations correctly model access patterns for locality analysis
The lowering pipeline constructs timestamp spaces and access maps via affine transformations.

pith-pipeline@v0.9.0 · 5583 in / 1340 out tokens · 81952 ms · 2026-05-10T19:07:07.457590+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 8 canonical work pages

[1]

Aho, Monica S

Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2006.Compilers: Principles, Techniques, and Tools(2nd ed.). Addison-Wesley

2006
[2]

Sadayappan

Wenlei Bao, Sriram Krishnamoorthy, Louis-Noël Pouchet, and P. Sadayappan. 2018. Analytical Modeling of Cache Behavior for Affine Programs.Proceedings of the ACM on Programming Languages2, POPL (2018), 32:1–32:26

2018
[3]

D’Hollander

Kristof Beyls and Erik H. D’Hollander. 2005. Generating Cache Hints for Improved Program Efficiency.Journal of Systems Architecture51, 4 (2005), 223–250

2005
[4]

Mark Blacher, Christoph Staudt, Julien Klaus, Maurice Wenig, Niklas Merk, Alexander Breuer, Max Engel, Sören Laue, and Joachim Giesen. 2024. Einsum Benchmark: Enabling the Development of Next-Generation Tensor Execution Engines. InThe Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track

2024
[5]

Cociorva, G

D. Cociorva, G. Baumgartner, C. Lam, P. Sadayappan, J. Ramanujam, M. Nooijen, D. E. Bernholdt, and R. Harrison
[6]

InProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation

Space-Time Trade-Off Optimization for a Class of Electronic Structure Calculations. InProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Berlin, Germany
[7]

Peter J. Denning. 2021. Working Set Analytics.Comput. Surveys53, 6 (2021), 113:1–113:36. doi:10.1145/3399709

work page doi:10.1145/3399709 2021
[8]

Somnath Ghosh, Margaret Martonosi, and Sharad Malik. 1999. Cache Miss Equations: A Compiler Framework for Analyzing and Tuning Memory Behavior.ACM Transactions on Programming Languages and Systems21, 4 (1999)

1999
[9]

Puzak, and Philip G

Allan Hartstein, Vijayalakshmi Srinivasan, Thomas R. Puzak, and Philip G. Emma. 2008. On the Nature of Cache Miss Behavior: Is It √ 2?Journal of Instruction-Level Parallelism10 (2008)

2008
[10]

Jia-Wei Hong and H. T. Kung. 1981. I/O Complexity: The Red-Blue Pebble Game. InProceedings of the ACM Symposium on Theory of Computing. Milwaukee, WI

1981
[11]

Amarasinghe

Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman P. Amarasinghe. 2017. The Tensor Algebra Compiler.Proceedings of the ACM on Programming Languages1, OOPSLA (2017), 77:1–77:29

2017
[12]

Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2021. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. InProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 2–14. doi:10.11...

work page doi:10.1109/cgo51591.2021.9370308 2021
[13]

Sadayappan

Rui Li, Aravind Sukumaran-Rajam, Richard Veras, Tze Meng Low, Fabrice Rastello, Atanas Rountev, and P. Sadayappan
[14]

InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Analytical Cache Modeling and Tilesize Optimization for Tensor Contractions. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 74:1–74:13
[15]

Fangzhou Liu, Yifan Zhu, Shaotong Sun, Chen Ding, Wesley Smith, and Kaave Hosseini. 2024. Parallel Loop Locality Analysis for Symbolic Thread Counts. InProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques (PACT). 219–232. doi:10.1145/3656019.3676948

work page doi:10.1145/3656019.3676948 2024
[16]

Mattson, Jan Gecsei, Donald R

Richard L. Mattson, Jan Gecsei, Donald R. Slutz, and Irving L. Traiger. 1970. Evaluation Techniques for Storage Hierarchies.IBM Systems Journal9, 2 (1970), 78–117

1970
[17]

Moses, Lorenzo Chelini, Ruizhe Zhao, and Oleksandr Zinenko

William S. Moses, Lorenzo Chelini, Ruizhe Zhao, and Oleksandr Zinenko. 2021. Polygeist: Raising C to Polyhedral MLIR. InProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT)

2021
[18]

In: PLDI

Auguste Olivry, Julien Langou, Louis-Noël Pouchet, P. Sadayappan, and Fabrice Rastello. 2020. Automated Derivation of Parametric Data Movement Lower Bounds for Affine Programs. InProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 808–822. doi:10.1145/3385412.3385989

work page doi:10.1145/3385412.3385989 2020
[19]

Arjun Pitchanathan, Kunwar Grover, and Tobias Grosser. 2024. Falcon: A Scalable Analytical Cache Model.Proceedings of the ACM on Programming Languages8, PLDI (2024), 222:1–222:25. doi:10.1145/3656452

work page doi:10.1145/3656452 2024
[20]

Louis-Noël Pouchet. 2012. PolyBench/C 4.0. http://polybench.sourceforge.net

2012
[21]

Amarasinghe

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman P. Amarasinghe
[22]

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines,

Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. InProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 519–530. doi:10.1145/2491956.2462176

work page doi:10.1145/2491956.2462176
[23]

Wesley Smith, Aidan Goldfarb, and Chen Ding. 2022. Beyond Time Complexity: Data Movement Complexity Analysis for Matrix Multiplication. arXiv:2203.02536 [cs.DS] https://arxiv.org/abs/2203.02536

work page arXiv 2022
[24]

Sven Verdoolaege. 2010. isl: An Integer Set Library for the Polyhedral Model. InMathematical Software – ICMS 2010 (Lecture Notes in Computer Science, Vol. 6327), Komei Fukuda, Joris Hoeven, Michael Joswig, and Nobuki Takayama AutoLALA: Automatic Loop Algebraic Locality Analysis for AI and HPC Kernels 13 (Eds.). Springer, 299–302

2010
[25]

Sven Verdoolaege et al. 2024. Barvinok: A Library for Counting the Number of Integer Points in Parametric and Non-Parametric Polytopes. https://barvinok.sourceforge.io/ Version 0.41.8

2024
[26]

Sven Verdoolaege, Rachid Seghir, Kristof Beyls, Vincent Loechner, and Maurice Bruynooghe. 2007. Counting Integer Points in Parametric Polytopes Using Barvinok’s Rational Functions.Algorithmica48, 1 (2007), 37–66. doi:10.1007/ s00453-006-1231-0

2007
[27]

Xiaoya Xiang, Chen Ding, Hao Luo, and Bin Bao. 2013. HOTL: A Higher Order Theory of Locality. (2013), 343–356

2013
[28]

Denning, and Yunquan Zhang

Liang Yuan, Chen Ding, Wesley Smith, Peter J. Denning, and Yunquan Zhang. 2019. A Relational Theory of Locality. ACM Transactions on Architecture and Code Optimization16, 3 (2019), 33:1–33:26

2019
[29]

Yutao Zhong, Xipeng Shen, and Chen Ding. 2009. Program Locality Analysis Using Reuse Distance.ACM Transactions on Programming Languages and Systems31, 6 (2009), 20:1–20:39

2009
[30]

Yifan Zhu, Yekai Pan, Chen Ding, and Yanghui Wu. 2026. Fully Symbolic Analysis of Loop Locality: Using Imaginary Reuse to Infer Real Performance. arXiv:2603.10196 [cs.PL] https://arxiv.org/abs/2603.10196

work page arXiv 2026