pith. machine review for the scientific record. sign in

arxiv: 2605.10678 · v1 · submitted 2026-05-11 · 💻 cs.CE · cs.MS

Recognition: no theorem link

A Performance-Portable, Massively Parallel Distributed Nonuniform FFT

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:12 UTC · model grok-4.3

classification 💻 cs.CE cs.MS
keywords nonuniform fast Fourier transformperformance portabilitydistributed computingKokkosGPU accelerationstrong scalingplasma simulationsParticle-in-Fourier
0
0 comments X

The pith

The first distributed performance-portable NUFFT runs on NVIDIA and AMD GPUs and scales to 1024 units.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a distributed nonuniform fast Fourier transform that executes across many GPUs without code changes for different hardware vendors. It relies on Kokkos to implement spreading and interpolation steps that achieve single-GPU speeds matching or exceeding a leading specialized CUDA library while also running on AMD accelerators. Large-scale tests on three supercomputers confirm strong scaling to 1024 GPUs, with the distributed FFT component growing in relative cost at high node counts. The work then applies the method to Particle-in-Fourier plasma simulations containing up to 1024 cubed modes and 8.6 billion particles. This matters because prior NUFFT codes were confined to single nodes, limiting the size of spectral calculations on irregular data in imaging, molecular dynamics, and kinetic simulations.

Core claim

We present the first distributed, performance-portable NUFFT for heterogeneous supercomputers. Our Kokkos-based implementation runs without modification on NVIDIA and AMD GPUs. We develop multiple spreading and interpolation kernels optimized for different accuracy requirements and architectures. Our spreading kernels match or exceed the single-GPU throughput of the state-of-the-art CUDA-based NUFFT library cuFINUFFT at production particle densities, while our Kokkos-based implementation additionally supports AMD GPUs. Strong scaling experiments on Alps, JUWELS Booster, and LUMI demonstrate scaling up to 1024 GPUs. At scale, the distributed FFT is a significant part of the total runtime, and

What carries the argument

Kokkos abstractions for spreading, interpolation, and distributed FFT operations that enable the same source code to deliver competitive performance across GPU architectures.

Load-bearing premise

The Kokkos layer produces throughput comparable to hand-tuned vendor code on both NVIDIA and AMD GPUs without hidden architecture-specific changes, and the added communication cost of the distributed FFT does not prevent scaling.

What would settle it

A side-by-side benchmark at fixed particle density showing the Kokkos version on AMD MI250X GPUs delivers less than 60 percent of the throughput of the same code on NVIDIA GH200 or A100 GPUs, or a strong-scaling curve that flattens before 512 GPUs.

Figures

Figures reproduced from arXiv: 2605.10678 by Andreas Adelmann, Paul Fischill, Sriramkrishnan Muralikrishnan.

Figure 2
Figure 2. Figure 2: Kernel throughput versus tolerance on LUMI [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: Kernel throughput as a function of tolerance on [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Kernel throughput versus particle density on [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pruned versus full FFT (forward) on Alps (GH200, [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pruned versus full FFT (forward) on JUWELS [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Component breakdown for the full-FFT NUFFT on [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Strong scaling of the distributed NUFFT for a [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Strong scaling of PIF Landau damping at 10243 modes, 𝜌 = 8, 𝜀 = 10−4 . Solid: pruned FFT; dashed: full FFT. Shading shows min–max range. (4) Advance particle positions and velocities. Steps 1 and 3, the scatter and gather operations, dominate runtime. The main reason PIF schemes have not been pursued in produc￾tion kinetic plasma simulations, despite their attractive numerical properties, is the lack of f… view at source ↗
Figure 12
Figure 12. Figure 12: NUFFT component breakdown on LUMI at 5123 , 𝜀 = 10−8 , using full FFTs. Particle–grid kernels dominate even at high GPU counts. the application’s particle communication. This effect is less pro￾nounced at 10243 where the larger per-GPU workload yields 50.7% parallel efficiency ( [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
read the original abstract

The nonuniform fast Fourier transform (NUFFT) enables spectral methods for problems with irregularly spaced samples, with applications in medical imaging, molecular dynamics, and kinetic plasma simulations. Existing implementations are limited to shared-memory execution, restricting problem sizes to what fits on a single node. We present the first distributed, performance-portable NUFFT for heterogeneous supercomputers. Our Kokkos-based implementation runs without modification on NVIDIA and AMD GPUs. We develop multiple spreading and interpolation kernels optimized for different accuracy requirements and architectures. Our spreading kernels match or exceed the single-GPU throughput of the state-of-the-art CUDA-based NUFFT library cuFINUFFT at production particle densities, while our Kokkos-based implementation additionally supports AMD GPUs. Strong scaling experiments on Alps (NVIDIA GH200), JUWELS Booster (NVIDIA A100), and LUMI (AMD MI250X) demonstrate scaling up to 1024 GPUs. At scale, the distributed FFT is a significant part of the total runtime, making higher NUFFT accuracy less expensive. We apply the method to massively parallel Particle-in-Fourier simulations of Landau damping with up to $1024^3$ Fourier modes and 8.6 billion particles on Alps, JUWELS, and LUMI, demonstrating that distributed NUFFTs enable kinetic plasma simulations at resolutions previously inaccessible to spectral particle methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce the first distributed, performance-portable NUFFT implementation using Kokkos for heterogeneous supercomputers supporting both NVIDIA and AMD GPUs. The spreading kernels are said to match or exceed the single-GPU performance of cuFINUFFT at production densities, with strong scaling demonstrated up to 1024 GPUs on systems like Alps, JUWELS, and LUMI, and applied to large-scale Particle-in-Fourier simulations of Landau damping with up to 1024^3 Fourier modes and 8.6 billion particles.

Significance. This work could significantly advance the field by enabling spectral methods for very large problems on modern GPU-based supercomputers, particularly benefiting kinetic plasma simulations that require high resolution. The performance-portable aspect is valuable for users across different hardware platforms, and the scaling results suggest practical utility at exascale.

major comments (2)
  1. [Abstract] The abstract reports performance parity with cuFINUFFT and scaling to 1024 GPUs but provides no quantitative data, error bars, exact benchmark conditions, or methodology details, which is necessary to evaluate the claims.
  2. [Section 4] The kernel descriptions and performance results must explicitly demonstrate that only portable Kokkos primitives are used in the spreading and interpolation kernels, without architecture-specific tuning or vendor intrinsics, to support the claim of competitive performance on AMD GPUs without modification.
minor comments (1)
  1. [Abstract] Clarify the specific accuracy requirements for which the multiple spreading kernels were optimized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation of our work's significance and for the constructive major comments. We address each point below and will revise the manuscript to strengthen the presentation of our claims while respecting abstract length constraints.

read point-by-point responses
  1. Referee: [Abstract] The abstract reports performance parity with cuFINUFFT and scaling to 1024 GPUs but provides no quantitative data, error bars, exact benchmark conditions, or methodology details, which is necessary to evaluate the claims.

    Authors: We agree that including select quantitative details would help readers evaluate the claims more readily. Abstracts are subject to strict length limits, so we will add a concise sentence reporting key metrics (e.g., single-GPU spreading throughput in particles per second at representative densities and strong-scaling efficiency to 1024 GPUs) together with the primary benchmark platforms and particle densities used for the cuFINUFFT comparison. Full error bars, methodology, and per-run data will remain in Sections 4 and 5, but the added abstract sentence will provide the requested context without exceeding typical abstract word counts. revision: partial

  2. Referee: [Section 4] The kernel descriptions and performance results must explicitly demonstrate that only portable Kokkos primitives are used in the spreading and interpolation kernels, without architecture-specific tuning or vendor intrinsics, to support the claim of competitive performance on AMD GPUs without modification.

    Authors: We appreciate this clarification request. Our kernels were written exclusively with Kokkos portable abstractions (parallel_for, TeamPolicy, View, atomic_add, and Kokkos::complex) and contain no CUDA, HIP, or vendor-intrinsic code. In the revised Section 4 we will (i) list the exact Kokkos primitives employed in each kernel variant, (ii) include short pseudocode excerpts illustrating the portable implementation, and (iii) add an explicit statement that the identical source compiles and runs on both NVIDIA and AMD GPUs. These additions will directly substantiate the performance-portability claim while preserving the existing performance results. revision: yes

Circularity Check

0 steps flagged

No circularity: implementation and benchmarking paper with no derivations

full rationale

This is a software engineering and performance benchmarking paper describing a Kokkos-based distributed NUFFT implementation. It contains no mathematical derivation chain, no fitted parameters presented as predictions, no self-citation load-bearing uniqueness theorems, and no ansatz smuggling. All central claims (portability, throughput matching cuFINUFFT, strong scaling to 1024 GPUs) are supported by direct empirical measurements on multiple architectures and machines. The paper is self-contained against external benchmarks and therefore exhibits no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied computer science implementation paper. No free parameters, mathematical axioms, or invented entities are introduced; performance claims rest on empirical benchmarks of existing Kokkos and FFT primitives.

pith-pipeline@v0.9.0 · 5549 in / 1096 out tokens · 52547 ms · 2026-05-12T05:12:57.089345+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

  1. [1]

    Axel Arnold, Florian Fahrenberger, Christian Holm, Olaf Lenz, Matthias Bolten, Holger Dachsel, Rene Halver, Ivo Kabadshow, Franz Gähler, Frederik Heber, Julian Iseringhausen, Michael Hofmann, Michael Pippig, Daniel Potts, and Godehard Sutmann. 2013. Comparison of scalable fast methods for long-range interactions. Physical Review E88, 6 (Dec. 2013), 063308...

  2. [2]

    Alan Ayala, Stanimire Tomov, Azzam Haidar, and Jack Dongarra. 2020. heFFTe: Highly Efficient FFT for Exascale. InComputational Science – ICCS 2020, Valeria V. Krzhizhanovskaya, Gábor Závodszky, Michael H. Lees, Jack J. Dongarra, Peter M. A. Sloot, Sérgio Brissos, and João Teixeira (Eds.). Vol. 12137. Springer Inter- national Publishing, Cham, 262–275. doi...

  3. [3]

    Alex H. Barnett. 2021. Aliasing error of theexp(𝛽 √ 1−𝑧 2 ) kernel in the nonuni- form fast Fourier transform.Applied and Computational Harmonic Analysis51 (March 2021), 1–16. doi:10.1016/j.acha.2020.10.002

  4. [4]

    Exponential of Semi- circle

    Alexander H. Barnett, Jeremy Magland, and Ludvig af Klinteberg. 2019. A Parallel Nonuniform Fast Fourier Transform Library Based on an “Exponential of Semi- circle” Kernel.SIAM Journal on Scientific Computing41, 5 (Jan. 2019), C479–C504. doi:10.1137/18M120885X

  5. [5]

    C. K. Birdsall and A. B. Langdon. 1991.Plasma Physics via Computer Simulation. IOP Publishing, Bristol, UK. https://ui.adsabs.harvard.edu/abs/1991ppcs.book..... B ADS Bibcode: 1991ppcs.book.....B

  6. [6]

    Carter Edwards, Christian R

    H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. 2014. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns.J. Parallel and Distrib. Comput.74, 12 (Dec. 2014), 3202–3216. doi:10. 1016/j.jpdc.2014.07.003

  7. [7]

    Zhe Chen and Charles S. Peskin. 2024. A Fourier spectral immersed boundary method with exact translation invariance, improved boundary resolution, and a divergence-free velocity field.J. Comput. Phys.509 (July 2024), 113048. doi:10. 1016/j.jcp.2024.113048

  8. [8]

    Yeung, and Richard Vuduc

    Kenneth Czechowski, Casey Battaglino, Chris McClanahan, Kartik Iyer, P.-K. Yeung, and Richard Vuduc. 2012. On the communication complexity of 3D FFTs and its implications for Exascale. InProceedings of the 26th ACM International Conference on Supercomputing. ACM, San Servolo Island, Venice, Italy, 205–214. doi:10.1145/2304576.2304604

  9. [9]

    Dutt and V

    A. Dutt and V. Rokhlin. 1993. Fast Fourier Transforms for Nonequispaced Data. SIAM Journal on Scientific Computing14, 6 (Nov. 1993), 1368–1393. doi:10.1137/ 0914081

  10. [10]

    Dutt and V

    A. Dutt and V. Rokhlin. 1995. Fast Fourier Transforms for Nonequispaced Data, II.Applied and Computational Harmonic Analysis2, 1 (Jan. 1995), 85–100. doi:10. 1006/acha.1995.1007

  11. [11]

    Chaolu Feng and Dazhe Zhao. 2015. CUDA accelerated uniform re-sampling for non-Cartesian MR reconstruction.Bio-Medical Materials and Engineering26 Suppl 1 (2015), S983–989. doi:10.3233/BME-151393

  12. [12]

    J. A. Fessler and B. P. Sutton. 2003. Nonuniform fast Fourier transforms using min-max interpolation.IEEE Transactions on Signal Processing51, 2 (Feb. 2003), 560–574. doi:10.1109/TSP.2002.807005

  13. [13]

    Matthias Frey, Alessandro Vinciguerra, Sriramkrishnan Muralikrishnan, Sonali, vmontanaro, Mohsen, Andreas Adelmann, manuel5975p, and Felix Schurk. 2024. IPPL-framework/ippl: IPPL-3.2.0. doi:10.5281/ZENODO.10878166

  14. [14]

    Holtrop, Xiao-Long Wu, Fan Lam, Maojing Fu, Justin P

    Jiading Gai, Nady Obeid, Joseph L. Holtrop, Xiao-Long Wu, Fan Lam, Maojing Fu, Justin P. Haldar, Wen-mei W. Hwu, Zhi-Pei Liang, and Bradley P. Sutton

  15. [15]

    Parallel and Distrib

    More IMPATIENT: A Gridding-Accelerated Toeplitz-based Strategy for Non-Cartesian High-Resolution 3D MRI on GPUs.J. Parallel and Distrib. Comput. 73, 5 (May 2013), 686–697. doi:10.1016/j.jpdc.2013.01.001

  16. [16]

    Leslie Greengard and June-Yub Lee. 2004. Accelerating the Nonuniform Fast Fourier Transform.SIAM Rev.46, 3 (Jan. 2004), 443–454. doi:10.1137/ S003614450343200X

  17. [17]

    Hockney and J

    Roger W. Hockney and J. W. Eastwood. 1981.Computer Simulation using particles. McGraw-Hill, New York, NY

  18. [18]

    James F. Kaiser. 1966. Digital Filters. InSystem Analysis by Digital Computer, Franklin F. Kuo and James F. Kaiser (Eds.). John Wiley and Sons, New York, Chapter 7, 218–285

  19. [19]

    Jens Keiner, Stefan Kunis, and Daniel Potts. 2009. Using NFFT 3—A Software Library for Various Nonequispaced Fast Fourier Transforms.ACM Trans. Math. Software36, 4 (Aug. 2009), 1–30. doi:10.1145/1555386.1555388

  20. [20]

    Sodickson

    Florian Knoll, Andreas Schwarzl, Clemens Diwoky, and Daniel K. Sodickson

  21. [21]

    InProceedings of the International Society for Magnetic Resonance in Medicine(Milan, Italy), Vol

    gpuNUFFT – An Open Source GPU Library for 3D Regridding with Direct Matlab Interface. InProceedings of the International Society for Magnetic Resonance in Medicine(Milan, Italy), Vol. 22. International Society for Magnetic Resonance in Medicine, Berkeley, CA, USA, 4297. https://archive.ismrm.org/2014/4297.html

  22. [22]

    Susanne Kunis and Stefan Kunis. 2012. The nonequispaced FFT on graphics processing units.PAMM12, 1 (Dec. 2012), 7–10. doi:10.1002/pamm.201210003

  23. [23]

    Jyh-Miin Lin. 2018. Python Non-Uniform Fast Fourier Transform (PyNUFFT): An Accelerated Non-Cartesian MRI Package on a Heterogeneous Platform (CPU/GPU).Journal of Imaging4, 3 (March 2018), 51. doi:10.3390/ jimaging4030051

  24. [24]

    J. Markel. 1971. FFT pruning.IEEE Transactions on Audio and Electroacoustics19, 4 (Dec. 1971), 305–311. doi:10.1109/TAU.1971.1162205

  25. [25]

    Mitchell, Matthew T

    Matthew S. Mitchell, Matthew T. Miecnikowski, Gregory Beylkin, and Scott E. Parker. 2019. Efficient Fourier basis particle simulation.J. Comput. Phys.396 (Nov. 2019), 837–847. doi:10.1016/j.jcp.2019.07.023

  26. [26]

    Guy M. Morton. 1966.A Computer Oriented Geodetic Data Base and a New Tech- nique in File Sequencing. Technical Report. IBM Ltd., Ottawa, Ontario, Canada

  27. [27]

    M. J. Muckley, R. Stern, T. Murrell, and F. Knoll. 2020. TorchKbNufft: A High- Level, Hardware-Agnostic Non-Uniform Fast Fourier Transform. ISMRM Work- shop on Data Sampling & Image Reconstruction. https://mmuckley.github. io/assets/publications/2020muckleytorchkbnufft.pdf Source code available at https://github.com/mmuckley/torchkbnufft

  28. [28]

    Cerfon, Miroslav Stoyanov, Rahulkumar Gayatri, and Andreas Adelmann

    Sriramkrishnan Muralikrishnan, Matthias Frey, Alessandro Vinciguerra, Michael Ligotino, Antoine J. Cerfon, Miroslav Stoyanov, Rahulkumar Gayatri, and Andreas Adelmann. 2024. Scaling and performance portability of the particle-in-cell scheme for plasma physics applications through mini-apps targeting exascale architectures. InProceedings of the 2024 SIAM C...

  29. [29]

    Sriramkrishnan Muralikrishnan and Robert Speck. 2025. Error Analysis and Parallel Scaling Study of a Parareal Parallel-in-Time Integration Algorithm for Particle-in-Fourier Schemes.SIAM Journal on Scientific Computing(Oct. 2025), S311–S336. doi:10.1137/24M1673097

  30. [30]

    2026.NVIDIA Hopper Tuning Guide

    NVIDIA Corporation. 2026.NVIDIA Hopper Tuning Guide. NVIDIA Corporation. https://docs.nvidia.com/cuda/hopper-tuning-guide/index.html Last updated April 2, 2026

  31. [31]

    2013.Prolate Spheroidal Wave Functions of Order Zero: Mathematical Tools for Bandlimited Approxi- mation

    Andrei Osipov, Vladimir Rokhlin, and Hong Xiao. 2013.Prolate Spheroidal Wave Functions of Order Zero: Mathematical Tools for Bandlimited Approxi- mation. Applied Mathematical Sciences, Vol. 187. Springer US, Boston, MA. doi:10.1007/978-1-4614-8259-8

  32. [32]

    2016.Massively Parallel, Fast Fourier Transforms and Particle-Mesh Methods

    Michael Pippig. 2016.Massively Parallel, Fast Fourier Transforms and Particle-Mesh Methods. PhD thesis. Technische Universität Chemnitz, Chemnitz, Germany

  33. [33]

    Michael Pippig and Daniel Potts. 2013. Parallel Three-Dimensional Noneq- uispaced Fast Fourier Transforms and Their Application to Particle Simula- tion.SIAM Journal on Scientific Computing35, 4 (Jan. 2013), C411–C437. doi:10.1137/120888478

  34. [34]

    2024.NonuniformFFTs.jl

    Juan Ignacio Polanco. 2024.NonuniformFFTs.jl. doi:10.5281/zenodo.14637606

  35. [35]

    Daniel Potts, Gabriele Steidl, and M. Tasche. 2001. Fast Fourier transforms for nonequispaced data: a tutorial.Mod. Sampl. theory23 (Jan. 2001), 19–25

  36. [36]

    Daniel Potts and Manfred Tasche. 2021. Continuous window functions for NFFT. Advances in Computational Mathematics47, 4 (Aug. 2021), 53. doi:10.1007/s10444- 021-09873-8

  37. [37]

    Daniel Potts and Manfred Tasche. 2021. Uniform error estimates for nonequi- spaced fast Fourier transforms.Sampling Theory, Signal Processing, and Data Analysis19, 2 (Dec. 2021), 17. doi:10.1007/s43670-021-00017-z

  38. [38]

    Changxiao Nigel Shen, Antoine Cerfon, and Sriramkrishnan Muralikrishnan

  39. [39]

    A particle-in-Fourier method with semi-discrete energy conservation for non-periodic boundary conditions.J. Comput. Phys.519 (Dec. 2024), 113390. doi:10.1016/j.jcp.2024.113390

  40. [40]

    Yu-hsuan Shih, Garrett Wright, Joakim Anden, Johannes Blaschke, and Alex H. Barnett. 2021. cuFINUFFT: a load-balanced GPU library for general-purpose nonuniform FFTs. In2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, Portland, OR, USA, 688–697. doi:10.1109/ IPDPSW52791.2021.00105

  41. [41]

    H. V. Sorensen and C. S. Burrus. 1993. Efficient computation of the DFT with only a subset of input or output points.IEEE Transactions on Signal Processing41, 3 (March 1993), 1184–1200. doi:10.1109/78.205723

  42. [42]

    Jingzhu Yang, Chaolu Feng, and Dazhe Zhao. 2013. A CUDA-based reverse gridding algorithm for MR reconstruction.Magnetic Resonance Imaging31, 2 (Feb. 2013), 313–323. doi:10.1016/j.mri.2012.06.038