pith. sign in

arxiv: 2606.20917 · v1 · pith:W5SFBP4Snew · submitted 2026-06-18 · 💻 cs.MS · cs.NA· math.NA· physics.comp-ph

NektarIR: A Domain-Specific Compiler for High-Order Finite Element Operations on Heterogeneous Hardware

Pith reviewed 2026-06-26 15:05 UTC · model grok-4.3

classification 💻 cs.MS cs.NAmath.NAphysics.comp-ph
keywords domain-specific compilerMLIRfinite element methodsheterogeneous hardwarespectral element methodscomputational fluid dynamicsJIT compilationhigh-order methods
0
0 comments X

The pith

A domain-specific MLIR compiler lowers high-level finite element abstractions to optimized kernels for CPUs and GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops NektarIR, a compiler using the MLIR framework, to handle high-order finite element operations on different hardware. It starts with a high-level domain abstraction that is gradually lowered to low-level code, allowing optimizations based on domain knowledge at each stage. This approach targets the common operators in spectral/hp element methods for solving partial differential equations in fluid dynamics. The goal is to reduce the effort needed for hardware-specific optimizations while maintaining performance across architectures.

Core claim

Through the NektarIR MLIR dialect and its lowering pipeline, common finite element operators are represented at a domain-specific level and compiled just-in-time to efficient code for both CPU and GPU targets, as demonstrated by comparisons with the existing Nektar++ framework.

What carries the argument

The NektarIR dialect, a custom MLIR intermediate representation for finite element operators, together with a bespoke lowering pipeline that applies domain-aware optimizations during progressive lowering.

If this is right

  • Common finite element operators can be composed into kernels for discrete differential operators.
  • These kernels can be just-in-time compiled for CPU and GPU architectures.
  • Performance is achieved without manual bespoke optimization for each hardware vendor.
  • Applications in computational fluid dynamics using spectral/hp element methods benefit from this automation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar domain-specific compilers could be developed for other scientific computing domains beyond finite elements.
  • The approach may extend to additional hardware types like accelerators beyond CPU and GPU.
  • Integration with existing frameworks like Nektar++ could be further automated for broader adoption.

Load-bearing premise

The performance of the solvers depends on a small set of common finite element operators that can be fully represented and optimized in the MLIR dialect without losing efficiency or correctness.

What would settle it

Running the generated kernels on target hardware and finding that they underperform hand-tuned implementations on the same architecture by a significant margin would disprove the claim.

Figures

Figures reproduced from arXiv: 2606.20917 by David Moxey, Edward Erasmie-Jones, Giacomo Castiglioni.

Figure 1
Figure 1. Figure 1: IR example of a matrix-matrix product kernel using the Linalg and func dialects JIT compiled elemental operator kernels via MLIR and LLVM for both CPU and GPU hardware targets. Here, we will present new MLIR passes that utilize domain-specific information to perform optimizing transformations on the IR at various abstraction levels and transform the IR from a single, high-level abstraction to hardware spec… view at source ↗
Figure 2
Figure 2. Figure 2: Examples of the block type the final dimensions specified by the size will instead correspond to the number of quadrature points in each direction. • Finally the Layout attribute encodes the data ordering within the block, which we elaborate on below. The layout of the data is an important parameter to encode as this allows for specialised im￾plementations that can exploit specific ordering of data, partic… view at source ↗
Figure 3
Figure 3. Figure 3: Conversion from block backward transform, nir.bwd to a loop over elements and an elemental backward transform, nir.elemental_bwd. • Within the loop, a extract_slice operation returns a block corresponding to a single element in the mesh from the input to the bwd operation being transformed. • A elmnt.bwd operation is created, which returns the result of the backward transform applied to a single element. •… view at source ↗
Figure 4
Figure 4. Figure 4: Transformation from NektarIR representation of the backward transform to the affine dialect and explicit loops. The loop structure resembles the expected implementation for the backward transform as described by Equation (17). The figure contains a section of the transformed IR, with arrows indicating that the IR continues. • elmt_bwd is replaced by a collection of nested affine.for operations which repres… view at source ↗
Figure 5
Figure 5. Figure 5: The NektarIR lowering pipeline. Schematic overview of the dialects visited by the IR for an elemental operation before compilation to a particular hardware target. The dashed arrows represent the additional step required for SIMD code-generation for CPU targets. (*) The nvvm dialect is a vendor specific dialect for NVIDIA GPUs; other vendor specific dialects exist and are used to lower to their devices. su… view at source ↗
Figure 6
Figure 6. Figure 6: Loop coalescing of a triangular loop nest using loop coalescing IR transformations. The loops with non-constant upper bounds given by affine-maps are coalesced into a single loop. The upper bound of the resulting loop is obtained from an attribute attached to the innermost loop of the triangular loop nest (which is placed there when the loop is created as part of earlier transformations). Index arrays cont… view at source ↗
Figure 7
Figure 7. Figure 7: Time to lower the Helmholtz operator on hexahedral and tetrahedral elements from NektarIR to the LLVM dialect for both host and device targets. TOP and TOE refer to the two threading strategies, namely through the expansion modes and over the elements respectively. 1 2 3 4 5 6 7 8 9 Polynomial Order (p) 0.0 0.1 0.2 0.3 0.4 0.5 Time (s) Hex: Host Hex: TOP Hex: TOE Tet: Host Tet: TOP Tet: TOE [PITH_FULL_IMA… view at source ↗
Figure 8
Figure 8. Figure 8: Time to compile the Helmholtz operator on hexahedral and tetrahedral elements from NektarIR to the LLVM dialect for both host and device targets. TOP and TOE refer to the two threading strategies, namely through the expansion modes and over the elements respectively. approach and its suitability for adaptive simulations that require fast generation of new kernels as the polynomial order changes. 4.2 Runtim… view at source ↗
Figure 9
Figure 9. Figure 9: Throughput comparison of the AVX512 Helmholtz kernel in NektarIR and Nektar++ on AMD EPYC 9554 CPU. (A) and (B) show the throughput of the Helmholtz kernel on hexahedral elements while (C) and (D) correspond to the operator on tetrahedral elements. Each panel shows curves plotted on two logarithmic axes. where the number of degrees of freedom is given by the (total number of input modes)×(the number of ele… view at source ↗
Figure 10
Figure 10. Figure 10: Throughput comparison of the Helmholtz kernel in NektarIR and Nektar++ on a NVIDIA H100 GPU. (A)-(D) correspond to the threading through expansion mode method on (A)-(B) hexahedral and (C)-(D) tetrahedral elements. (E)-(H) correspond to the threading through elements method on (E)-(F) hexahedral and (G)-(H) tetrahedral elements. Each panel shows curves plotted on two logarithmic axes. ACM Trans. Math. Sof… view at source ↗
read the original abstract

Modern high performance computing (HPC) applications must target heterogeneous hardware. This requires significant work to ensure domain specific implementations translate to highly performant kernels across a range hardware types and vendors, each requiring bespoke optimization to make use of the specific target architecture. Through the development of a domain specific compiler built with the multi-level intermediate representations (MLIR) project, one can express a high-level, close to the specific domain, abstraction that is progressively lowered to a low, close to metal, abstraction. At each intermediate representation (IR), appropriate optimizations can be applied without costly analysis due to the knowledge embedded in the domain specific IRs. We apply this method to the construction of discrete differential operators for use in spectral/hp element method solvers for computational fluid dynamics (CFD). Here, the performance is driven by a small set of common finite element operators that are composed to create kernels for the discrete differential operators used to solve weak partial differential equations. We create our own MLIR dialect to represent these operators and implement a bespoke lowering pipeline to facilitate the just-in-time compilation of these kernels for both CPU and GPU architecture and illustrate performance comparisons with the Nektar++ spectral/hp element framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces NektarIR, a custom MLIR dialect for representing high-order finite element operators (mass, stiffness, etc.) in spectral/hp element methods for CFD. It describes a bespoke lowering pipeline that progressively lowers domain-specific IRs to CPU/GPU targets via MLIR, claiming that embedded domain knowledge enables optimizations without costly analysis and that the approach yields performance comparable to the hand-written kernels in the Nektar++ framework.

Significance. If the central preservation claim holds, the work would demonstrate a practical route to portable, high-performance FE kernels on heterogeneous hardware by leveraging MLIR's multi-level IR structure, potentially reducing the engineering effort required for architecture-specific tuning in spectral/hp solvers.

major comments (2)
  1. [Abstract] Abstract: the assertion that 'performance comparisons with the Nektar++ spectral/hp element framework' are illustrated is unsupported by any quantitative data, error analysis, benchmark tables, or figures; without these, the claim that the NektarIR lowering retains efficiency and correctness relative to reference operators cannot be evaluated.
  2. [Abstract] Abstract: the description of the NektarIR dialect and 'bespoke lowering pipeline' supplies neither the dialect operation definitions, the lowering rules between IR levels, nor any verification (e.g., mathematical equivalence checks or floating-point reproducibility tests) that the generated kernels for discrete differential operators remain equivalent to those in Nektar++; this directly undermines the weakest assumption that common FE operators can be captured and lowered without loss of efficiency or correctness.
minor comments (1)
  1. [Abstract] The abstract refers to 'a small set of common finite element operators' but does not enumerate them or indicate which subset is implemented in the dialect.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments correctly identify areas where the abstract overstates the manuscript's content without sufficient supporting material. We will revise the manuscript to address these issues directly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that 'performance comparisons with the Nektar++ spectral/hp element framework' are illustrated is unsupported by any quantitative data, error analysis, benchmark tables, or figures; without these, the claim that the NektarIR lowering retains efficiency and correctness relative to reference operators cannot be evaluated.

    Authors: We agree that the abstract should not assert that performance comparisons are illustrated without quantitative support being evident. The current manuscript text provided does not include benchmark tables, figures, or error analysis to back this claim. We will revise the abstract to remove or qualify the unsupported assertion and add a concise summary of performance results (including key metrics and references to tables/figures) along with basic error analysis in the revised version. revision: yes

  2. Referee: [Abstract] Abstract: the description of the NektarIR dialect and 'bespoke lowering pipeline' supplies neither the dialect operation definitions, the lowering rules between IR levels, nor any verification (e.g., mathematical equivalence checks or floating-point reproducibility tests) that the generated kernels for discrete differential operators remain equivalent to those in Nektar++; this directly undermines the weakest assumption that common FE operators can be captured and lowered without loss of efficiency or correctness.

    Authors: The referee is correct that the abstract provides no operation definitions, lowering rules, or verification evidence. While the full manuscript describes the dialect and pipeline at a high level, it does not supply the requested specifics or equivalence tests. We will revise by adding an appendix or expanded section with sample NektarIR operation definitions, key lowering rules, and verification results (e.g., mathematical equivalence and reproducibility tests) to substantiate the claims. revision: yes

Circularity Check

0 steps flagged

No circularity; implementation description relies on external MLIR framework

full rationale

The paper presents an engineering effort to build a new MLIR dialect (NektarIR) and lowering pipeline for representing and compiling finite-element operators. No mathematical derivation chain, fitted parameters, or predictions are claimed. The central assertions concern the feasibility of capturing common operators (mass, stiffness, etc.) in the dialect and lowering them without loss of correctness or efficiency; these are supported by reference to the external MLIR project rather than any self-referential construction or self-citation load-bearing step. The work is therefore self-contained as an implementation report.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the MLIR infrastructure and the assumption that finite element operators form a small composable set suitable for domain-specific IR representation. No free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption MLIR provides a suitable multi-level IR framework for embedding domain knowledge and applying staged optimizations without costly analysis.
    Invoked in the description of the compiler architecture and lowering pipeline.
invented entities (1)
  • NektarIR MLIR dialect no independent evidence
    purpose: To represent high-order finite element operators at a domain-specific level.
    New dialect created by the authors for this purpose.

pith-pipeline@v0.9.1-grok · 5755 in / 1238 out tokens · 22562 ms · 2026-06-26T15:05:59.774212+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 22 canonical work pages

  1. [1]

    High-performance finite elements with MFEM.International Journal of High Performance Computing Applications38, 5 (2024), 447–467. doi:10.1177/10943420241261981 Daniel Arndt, Wolfgang Bangerth, Maximilian Bergbauer, Bruno Blais, Marc Fehling, Rene Gassmöller, Timo Heister, Luca Heltai, Martin Kronbichler, Matthias Maier, Peter Munch, Sam Scheuerman, Bruno ...

  2. [2]

    The deal.ii library, version 9.7,

    The deal.II library, Version 9.7.Journal of Numerical Mathematics33, 4 (2025), 403–415. doi:10.1515/jnma-2025-0115 Igor A Baratta, Joseph P Dean, Jørgen S Dokken, Jack S Hale, Chris N Richardson, Marie E Rognes, Matthew W Scroggs, Nathan Sime, and Garth N Wells

  3. [3]

    DOLFINx: The next generation FEniCS problem solving environment.10.5281/zen- odo.10447665.(12 2025). doi:10.5281/zenodo.18101307 Peter Bastian, Markus Blatt, Andreas Dedner, Nils Arne Dreier, Christian Engwer, René Fritze, Carsten Gräser, Christoph Grüninger, Dominic Kempf, Robert Klöfkorn, Mario Ohlberger, and Oliver Sander

  4. [4]

    doi:10.1016/j.camwa

    The DUNE framework: Basic concepts and recent developments.Computers and Mathematics with Applications81 (2021), 75–112. doi:10.1016/j.camwa. 2020.06.007 Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah

  5. [5]

    Shah, and Alan Edelman

    Julia: A fresh approach to numerical computing. SIAM Rev.59, 1 (2017), 65–98. doi:10.1137/141000671 Aart Bik, Penporn Koanantakool, Tatiana Shpeisman, Nicolas Vasilache, Bixia Zheng, and Fredrik Kjolstad

  6. [6]

    doi:10.1145/3544559 Amy

    Compiler Support for Sparse Tensor Computations in MLIR.ACM Transactions on Architecture and Code Optimization19, 4 (9 2022), 1–25. doi:10.1145/3544559 Amy. Brown and Greg. Wilson. 2011.The architecture of open source applications : elegance, evolution, and a few fearless hacks. [CreativeCommons], CA, USA. 415 pages. https://aosabook.org/en/v1/llvm.html C...

  7. [7]

    doi:10.1016/j.cpc.2015.02.008 Clang [n

    Nektar++: An open-source spectral/hp element framework.Computer Physics Communications192 (7 2015), 205–219. doi:10.1016/j.cpc.2015.02.008 Clang [n. d.].Clang: a C language family frontend for LLVM. Retrieved 1 Jun 2026 from https://clang.llvm.org/ Philippe Clauss, Ervin Altintas, and Matthieu Kuhn

  8. [8]

    InProceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium, IPDPS

    Automatic Collapsing of Non-Rectangular Loops. InProceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium, IPDPS

  9. [9]

    2023.StableHLO Specification

    doi:10.1109/IPDPS.2017.34 OpenXLA Community. 2023.StableHLO Specification. Accessed: 1 Jun

  10. [10]

    ACM Trans

    Efficient vectorised kernels for unstructured high-order finite element fluid solvers on GPU architectures in two dimensions.Computer Physics Communications284 (3 2023), 108624. ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: June

  11. [11]

    NektarIR: A Domain-Specific Compiler for High-Order Finite Element Operations on Heterogeneous Hardware 23 doi:10.1016/j.cpc.2022.108624 Marc Fehling and Wolfgang Bangerth

  12. [12]

    Algorithms for Parallel Generic hp-Adaptive Finite Element Software.ACM Trans. Math. Softw.49, 3, Article 25 (Sept. 2023), 26 pages. doi:10.1145/3603372 Paul Fischer, Stefan Kerkemeier, Misun Min, Yu-Hsiang Lan, Malachi Phillips, Thilina Rathnayake, Elia Merzari, Ananias Tomboulides, Ali Karakus, Noel Chalmers, and Tim Warburton

  13. [13]

    doi:10.1016/j.parco.2022.102982 Tobias Gysi, Christoph Müller, Oleksandr Zinenko, Stephan Herhut, Eddie Davis, Tobias Wicky, Oliver Fuhrer, Torsten Hoefler, and Tobias Grosser

    NekRS, a GPU-accelerated spectral element Navier–Stokes solver.Parallel Comput.114 (2022), 102982. doi:10.1016/j.parco.2022.102982 Tobias Gysi, Christoph Müller, Oleksandr Zinenko, Stephan Herhut, Eddie Davis, Tobias Wicky, Oliver Fuhrer, Torsten Hoefler, and Tobias Grosser

  14. [14]

    doi:10.1145/3469030 Niclas Jansson, Martin Karp, Artur Podobas, Stefano Markidis, and Philipp Schlatter

    Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-Accelerated Climate Simulation.ACM Transactions on Architecture and Code Optimization18, 4 (12 2021), 1–23. doi:10.1145/3469030 Niclas Jansson, Martin Karp, Artur Podobas, Stefano Markidis, and Philipp Schlatter

  15. [15]

    doi:10.1016/j.compfluid.2024.106243 George Karniadakis and Spencer Sherwin

    Neko: A modern, portable, and scalable framework for high-fidelity computational fluid dynamics.Computers and Fluids275 (2024), 106243. doi:10.1016/j.compfluid.2024.106243 George Karniadakis and Spencer Sherwin. 2005.Spectral/hp Element Methods for Computational Fluid Dynamics(2nd ed.). Oxford University Press, Oxford, United Kingdom. doi:10.1093/acprof:o...

  16. [16]

    High-order splitting methods for the incompressible Navier-Stokes equations.J. Comput. Phys.97, 2 (1991), 414–443. doi:10.1016/0021-9991(91)90007-8 Kaloyan S. Kirilov, Jingtian Zhou, Joaquim Peiró, and David Moxey

  17. [17]

    doi:10.1016/j.cad.2025.103962 S

    High-order curvilinear mesh generation from third-party meshes.Computer-Aided Design191 (2026), 103962. doi:10.1016/j.cad.2025.103962 S. Klabnik, C. Nichols, and C. Krycho. 2026.The Rust Programming Language, 3rd Edition. No Starch Press. https: //books.google.co.uk/books?id=Nm9REQAAQBAJ Tzanio Kolev, Paul Fischer, Misun Min, Jack Dongarra, Jed Brown, Ves...

  18. [18]

    Efficient exascale discretizations: High-order finite element methods,

    Efficient exascale discretizations: High-order finite element methods.International Journal of High Performance Computing Applications35, 6 (11 2021), 527–552. doi:10.1177/10943420211020803 Chris Lattner and Vikram Adve

  19. [19]

    CoRRabs/2002.11054 (2020), 1–21

    MLIR: A Compiler Infrastructure for the End of Moore’s Law. CoRRabs/2002.11054 (2020), 1–21. https://arxiv.org/abs/2002.11054 Hsin I.Cindy Liu, Marius Brehler, Mahesh Ravishankar, Nicolas Vasilache, Ben Vanik, and Stella Laurenzo

  20. [20]

    doi:10.1109/MM.2022.3178068 LLVM

    TinyIREE: An ML Execution Environment for Embedded Systems from Compilation to Deployment.IEEE Micro42, 5 (2022), 9–16. doi:10.1109/MM.2022.3178068 LLVM. [n. d.].Torch-MLIR. Accessed: 1 Jun

  21. [21]

    arXiv:https://doi.org/10.1137/20M1345359 doi:10.1137/20M1345359 Pascal Mossier, Daniel Appel, Andrea D

    Industry-Relevant Implicit Large-Eddy Simulation of a High-Performance Road Car via Spectral/hp Element Methods.SIAM Rev.63, 4 (2021), 723–755. arXiv:https://doi.org/10.1137/20M1345359 doi:10.1137/20M1345359 Pascal Mossier, Daniel Appel, Andrea D. Beck, and Claus-Dieter Munz

  22. [22]

    An Efficient hp-Adaptive Strategy for a Level-Set Ghost-Fluid Method.J. Sci. Comput.97, 2 (Oct. 2023), 41 pages. doi:10.1007/s10915-023-02363-7 David Moxey, Roman Amici, and Mike Kirby. 2020a. Efficient matrix-free high-order finite element evaluation for simplicial elements.SIAM Journal on Scientific Computing42, 3 (2020), C97–C123. doi:10.1137/19M124652...

  23. [23]

    Spectral methods for problems in complex geometries.J. Comput. Phys.37, 1 (1980), 70–92. doi:10.1016/0021-9991(80)90005-4 Florian Rathgeber, David A. Ham, Lawrence Mitchell, Michael Lange, Fabio Luporini, Andrew T.T. McRae, Gheorghe Teodor Bercea, Graham R. Markall, and Paul H.J. Kelly

  24. [24]

    Firedrake: Automating the finite element method by composing abstractions.ACM Trans. Math. Software43, 3 (2016), 1–27. doi:10.1145/2998441 Samuel Williams, Andrew Waterman, and David Patterson

  25. [25]

    Williams, A

    Roofline: an insightful visual performance model for multicore architectures.Commun. ACM52, 4 (April 2009), 65–76. doi:10.1145/1498765.1498785 Jacques Y. Xing, Boyang Xia, Diego Renner, Chris D. Cantwell, David Moxey, Robert M. Kirby, and Spencer J. Sher- win

  26. [26]

    arXiv:2604.04644 [math.NA] https://arxiv.org/abs/2604.04644 ACM Trans

    Architecture-aware ℎ-to-𝑝 optimisation: spectral/ ℎ𝑝 element operators for mixed-element meshes. arXiv:2604.04644 [math.NA] https://arxiv.org/abs/2604.04644 ACM Trans. Math. Softw., Vol. 1, No. 1, Article . Publication date: June 2026