pith. sign in

arxiv: 2605.16082 · v1 · pith:WWZZXKGKnew · submitted 2026-05-15 · 💻 cs.DC · physics.ao-ph· physics.comp-ph· physics.flu-dyn

An efficient multi-GPU implementation for the Discontinuous Galerkin ocean model SLIM

Pith reviewed 2026-05-19 18:37 UTC · model grok-4.3

classification 💻 cs.DC physics.ao-phphysics.comp-phphysics.flu-dyn
keywords Discontinuous GalerkinGPU computingocean modelingSLIMmulti-GPUunstructured meshcoastal simulationhigh-performance computing
0
0 comments X

The pith

A GPU-optimized Discontinuous Galerkin ocean model achieves the speed of roughly 1500 CPU cores on a single card and scales to 1024 GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a complete 3D implementation of the SLIM unstructured-mesh ocean model that uses Discontinuous Galerkin finite elements and is rewritten for both single-GPU and multi-GPU hardware. It reports that one high-end GPU delivers performance comparable to about 1500 CPU cores and that swapping a 128-core CPU node for a four-GPU node produces roughly fifty times the speed. The code keeps high weak-scaling efficiency out to 1024 GPUs and is tested on a Great Barrier Reef domain at five times the spatial resolution of earlier models while running one hundred times faster than real time. This work targets the long-standing barrier that high computational cost has placed on detailed coastal simulations with DG methods.

Core claim

Mapping the DG-FE ocean equations to GPU kernels through optimized memory layouts, element-wise parallelization, and matrix-free treatment of vertical processes produces an implementation that runs efficiently on both NVIDIA and AMD GPUs, maintains weak scaling to 1024 devices, and supports real-world coastal runs at previously unattainable resolution.

What carries the argument

GPU kernels for Discontinuous Galerkin finite elements that use matrix-free vertical solvers and distributed multi-GPU communication.

If this is right

  • A four-GPU node can replace a 128-core CPU node and deliver about fifty times higher throughput for the same coastal model.
  • Spatial resolution five times finer than current best models becomes feasible while still running faster than real time.
  • Weak scaling that holds to 1024 GPUs opens the door to ensemble forecasts or basin-scale high-resolution studies.
  • The same kernel strategies apply to both NVIDIA and AMD architectures, reducing dependence on a single vendor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar GPU mappings could be applied to other DG-based fluid models in atmosphere or ice-sheet science.
  • Routine availability of such resolution may improve forecasts of localized coastal hazards such as reef bleaching or storm surge.
  • The approach could be combined with adaptive mesh refinement to focus compute only where needed.
  • Operational centers might adopt GPU clusters to run multiple high-resolution scenarios within the same wall-clock window.

Load-bearing premise

The Discontinuous Galerkin formulation and vertical processes can be turned into GPU kernels whose communication overhead stays low enough that the reported benchmarks remain representative of full production runs.

What would settle it

A timing measurement on the Great Barrier Reef case that shows the physical-to-numerical time ratio falling well below 100 because of unexpected data-transfer costs would falsify the performance claims.

Figures

Figures reproduced from arXiv: 2605.16082 by Ange P. Ishimwe, Colin Scherpereel, Emmanuel Hanert, Jonathan Lambrechts, Miguel De Le Court, Vincent Legat.

Figure 1
Figure 1. Figure 1: Examples of meshes used in SLIM. The unstructured 2D triangular mesh is extruded vertically [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Schematic view of the five main components of a time step. The ordering shown corresponds [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Structure-of-Arrays (SoA) memory layout used in our implementation. Prisms within a column [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example of a cell layout for the same mesh as Figure [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example of a cell matrix with two layers and 128 columns for a scalar field (6 values per layer). [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Subset of a cell processed by a block of 128 threads. This example shows a cell with 128 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Kernels used for the computation of the horizontal terms of the momentum equation. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Kernels of the 2D external mode during a full step of the scheme. In this example, the external [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Kernels used for the computation of the vertical terms of the momentum equation during an [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Kernels used for the computation of the vertical terms of the momentum equation during an [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Timeline of both the Compute stream and Communications stream for the two main phases [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Timeline of both the Compute stream and Communications stream for 3 iterations of the 2D [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Performance of the 3D model on various hardware platforms with 32 layers and increasing [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Global memory bandwidth and floating-point throughput as a percent of peak over a complete [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Normalized time per iteration of the 3D scheme as a function of the number of layers for an [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Scaling of the 3D model with 32 layers on the MeluXina cluster with A100 GPUs. [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Efficiency of the 3D model with 32 layers on the MeluXina cluster with A100 GPUs. [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Scaling of the 3D model with 32 layers on the LUMI cluster equipped with MI250X GPUs. [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Computational mesh for the Great Barrier Reef configuration. The horizontal resolution [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Surface vertical vorticity at increasing levels of zoom in the Great Barrier Reef domain. The [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Modelled sea surface temperature on October 31, 2024 at 10:00, after two months of sim [PITH_FULL_IMAGE:figures/full_fig_p027_21.png] view at source ↗
read the original abstract

Unstructured-mesh ocean models are increasingly used for coastal applications due to their ability to represent complex geometries and apply local grid refinement where needed. However, their broader use has been hindered by their high computational cost, particularly for models based on the Discontinuous Galerkin finite element (DG-FE) method, which involves significantly more degrees of freedom than traditional finite volume or continuous finite element approaches. The rapid emergence of GPU-based high-performance computing architectures now offers a pathway to address this limitation, as DG-FE formulations are inherently well suited to massively parallel, element-wise computations. Here, we present a full 3D DG-FE ocean model implementation optimized for both single- and multi-GPU systems, with support for both NVIDIA and AMD architectures. We detail the computational strategies employed to achieve high performance, including memory layout optimization, kernel-level parallelization, and matrix-free solvers for key vertical processes. Benchmark results demonstrate that a single HPC-grade GPU (e.g. NVIDIA A100) delivers performance equivalent to approximately 1500 CPU cores, while replacing a 128-core CPU node with a 4xA100 GPU node yields a speedup of around 50x. Weak-scaling efficiency is maintained up to 1024 GPUs. We further demonstrate the model's capabilities on a real-world application in the Great Barrier Reef, achieving a spatial resolution five times finer than the most accurate existing model while maintaining a physical-to-numerical time ratio of 100. These results highlight how GPU-accelerated DG-FE methods can dramatically advance the capabilities of unstructured-mesh ocean modeling, enabling ultra-high-resolution coastal simulations that were previously infeasible.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. This paper presents an efficient implementation of the Discontinuous Galerkin finite element (DG-FE) ocean model SLIM for multi-GPU systems, including optimizations for memory layout, kernel parallelization, and matrix-free solvers. It reports performance benchmarks showing a single NVIDIA A100 GPU equivalent to approximately 1500 CPU cores, a 50x speedup when replacing a 128-core CPU node with a 4xA100 GPU node, maintained weak-scaling efficiency up to 1024 GPUs, and a real-world application to the Great Barrier Reef achieving five times finer spatial resolution with a physical-to-numerical time ratio of 100.

Significance. Should the reported performance and scaling results prove robust, this work would be significant for the field of computational ocean modeling. It demonstrates how GPU acceleration can overcome the high computational cost of DG-FE methods on unstructured meshes, potentially enabling ultra-high-resolution simulations of complex coastal environments that were not feasible before.

major comments (1)
  1. [Weak scaling results] The claim that weak-scaling efficiency is maintained up to 1024 GPUs is central to the multi-GPU contribution. However, the manuscript does not provide a breakdown of the fraction of wall time spent on inter-GPU communication versus computation at large scales. On unstructured meshes with refinement, such as the Great Barrier Reef application, halo exchange volumes can be irregular and substantial; without profiling data showing that communication remains a small percentage of total time, the scaling efficiency cannot be confidently assessed.
minor comments (2)
  1. [Benchmark description] The performance equivalence of one A100 to 1500 CPU cores and the 50x node speedup lack accompanying error bars, details on the exact CPU configuration (e.g., core count per node, processor type), and verification that all overheads are accounted for in the 1500-core equivalence.
  2. [Abstract] The abstract mentions support for both NVIDIA and AMD architectures but provides no specific performance numbers for AMD GPUs, which would help assess portability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential significance of our multi-GPU DG-FE implementation. We address the major comment below and will revise the manuscript to incorporate additional profiling data.

read point-by-point responses
  1. Referee: [Weak scaling results] The claim that weak-scaling efficiency is maintained up to 1024 GPUs is central to the multi-GPU contribution. However, the manuscript does not provide a breakdown of the fraction of wall time spent on inter-GPU communication versus computation at large scales. On unstructured meshes with refinement, such as the Great Barrier Reef application, halo exchange volumes can be irregular and substantial; without profiling data showing that communication remains a small percentage of total time, the scaling efficiency cannot be confidently assessed.

    Authors: We appreciate the referee's point that a communication-versus-computation breakdown would strengthen the scaling claims, particularly for the irregular halo exchanges that arise on locally refined unstructured meshes. The reported weak-scaling efficiencies are derived from full wall-clock timings that already include all inter-GPU communication; however, we agree that explicit profiling data would allow readers to assess the overhead more directly. In the revised manuscript we will add a new figure and accompanying text that report the measured fraction of wall time spent on halo exchanges (via CUDA-aware MPI or equivalent) at representative scale points up to 1024 GPUs. Where possible we will also include the corresponding breakdown for the Great Barrier Reef configuration. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims are direct experimental measurements

full rationale

This is an implementation and benchmarking paper whose central results consist of measured wall-clock times, speedups, and weak-scaling efficiencies obtained on actual GPU hardware. The reported equivalences (single A100 ≈ 1500 CPU cores, 50× node speedup, scaling to 1024 GPUs) and the Great Barrier Reef run metrics are direct outcomes of the described kernels and MPI/GPU-direct exchanges; they are not obtained by fitting parameters to a subset of the same data and then re-deriving the same quantities, nor by self-definitional equations or load-bearing self-citations. The derivation chain is therefore self-contained against external hardware benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an engineering implementation of an established DG-FE ocean model; it introduces no new physical axioms or invented entities and relies on standard parallel computing practices and existing ocean-model assumptions.

axioms (1)
  • domain assumption Standard assumptions of the Discontinuous Galerkin finite-element discretization for shallow-water and 3D ocean equations remain valid on GPU architectures.
    The implementation inherits the mathematical formulation of the original SLIM model without re-deriving or altering its governing equations.

pith-pipeline@v0.9.0 · 5861 in / 1442 out tokens · 48873 ms · 2026-05-19T18:37:09.946944+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

  1. [1]

    2020 , publisher =

    Allen Coral Atlas [Dataset] , shorttitle =. 2020 , publisher =. doi:10.5281/zenodo.3833242 , urldate =

  2. [2]

    2021 , doi =

    Global Distribution of Coral Reefs, Compiled from Multiple Sources Including the Millennium Coral Reef Mapping Project [Dataset] , author =. 2021 , doi =

  3. [3]

    2020 , publisher =

    AusBathyTopo (Great Barrier Reef) 30m 2017 -- A Regional-Scale Depth Model (20170025C) [Dataset] , author =. 2020 , publisher =. doi:10.4225/25/5a207b36022d2 , urldate =

  4. [4]

    2020 , publisher =

    High Resolution Depth Model for the Great Barrier Reef and Coral Sea 100 m [Dataset] , author =. 2020 , publisher =. doi:10.26186/5E2F8BB629D07 , urldate =

  5. [5]

    2023 , publisher =

    Torres Strait Bathymetry 30m 2020 -- A High-Resolution Depth Model (20200021C) [Dataset] , author =. 2023 , publisher =. doi:10.26186/144348 , urldate =

  6. [6]

    2020 , month = mar, publisher =

    Gulf of Papua Bathymetry Raster Dataset [Dataset] , author =. 2020 , month = mar, publisher =. doi:10.6084/m9.figshare.11986797.v1 , urldate =

  7. [7]

    2015 , urldate =

    Planet Dump Retrieved from https://planet.osm.org [Dataset] , author =. 2015 , urldate =

  8. [8]

    2023 , publisher =

    Bureau of Meteorology atmospheric high-resolution regional reanalysis for Australia -- Version 2 (BARRA2) [Dataset] , author =. 2023 , publisher =

  9. [9]

    Bluelink

    Chamberlain, Matthew and. Bluelink. doi:10.25914/2WXJ-VT48 , urldate =

  10. [10]

    2002 , journal =

    An oceanic general circulation model framed in hybrid isopycnic-Cartesian coordinates , author =. 2002 , journal =

  11. [11]

    2022 , institution =

    BARRA2: Development of the next-generation Australian regional atmospheric reanalysis , author =. 2022 , institution =

  12. [12]

    2024 , institution =

    BARRA-C2: Development of the kilometre-scale downscaled atmospheric reanalysis over Australia , author =. 2024 , institution =

  13. [13]

    2023 , journal =

    Simulations in the era of exascale Computing , author =. 2023 , journal =

  14. [14]

    , year =

    Chen, Changsheng and Liu, Hedong and Beardsley, Robert C. , year =. An unstructured grid, finite-volume, three-dimensional, primitive equations ocean model: application to coastal ocean and estuaries , shorttitle =. Journal of Atmospheric and Oceanic Technology , volume =

  15. [15]

    2021 , journal =

    Evolution of the graphics processing unit (GPU) , author =. 2021 , journal =

  16. [16]

    2017 , journal =

    The finite-volume sea ice-ocean model (FESOM2) , author =. 2017 , journal =

  17. [17]

    2015 , journal =

    Resolving eddies by local mesh refinement , author =. 2015 , journal =

  18. [18]

    2013 , journal =

    A parallel local timestepping Runge--Kutta discontinuous Galerkin method with applications to coastal ocean modeling , author =. 2013 , journal =

  19. [19]

    2002 , journal =

    Efficient inverse modeling of barotropic ocean tides , author =. 2002 , journal =

  20. [20]

    2021 , journal =

    Fast, cheap, and turbulent---Global ocean modeling with GPU acceleration in python , author =. 2021 , journal =

  21. [21]

    Thetis coastal ocean model: Discontinuous Galerkin discretization for the three-dimensional hydrostatic equations , shorttitle =

    K. Thetis coastal ocean model: Discontinuous Galerkin discretization for the three-dimensional hydrostatic equations , shorttitle =. 2018 , journal =

  22. [22]

    Korn, P. and Br. ICON-O: The ocean component of the ICON earth system model---Global simulation characteristics and local telescoping capability , shorttitle =. 2022 , journal =

  23. [23]

    2022 , month = mar, publisher =

    NEMO ocean engine [Software] , author =. 2022 , month = mar, publisher =. doi:10.5281/zenodo.6334656 , urldate =

  24. [24]

    1997 , journal =

    A finite-volume, incompressible Navier-Stokes model for studies of the ocean on parallel computers , author =. 1997 , journal =

  25. [25]

    1971 , journal =

    Oceanic diffusion diagrams , author =. 1971 , journal =

  26. [26]

    High performance regional ocean modeling with GPU acceleration , booktitle =

    Panzer, Ian and Lines, Spencer and Mak, Jason and Choboter, Paul and Lupo, Chris , year =. High performance regional ocean modeling with GPU acceleration , booktitle =

  27. [27]

    and McWilliams, James C

    Shchepetkin, Alexander F. and McWilliams, James C. , year =. The regional oceanic modeling system (ROMS): A split-explicit, free-surface, topography-following-coordinate oceanic model , shorttitle =. Ocean Modelling , volume =

  28. [28]

    2025 , journal =

    A GPU-based ocean dynamical core for routine mesoscale-resolving climate simulations , author =. 2025 , journal =

  29. [29]

    and Hill, Christopher and Ardakani, Matin Raayai and Blaschke, Johannes and Campin, Jean-Michel and Churavy, Valentin and Constantinou, Navid C

    Silvestri, Simone and Wagner, Gregory L. and Hill, Christopher and Ardakani, Matin Raayai and Blaschke, Johannes and Campin, Jean-Michel and Churavy, Valentin and Constantinou, Navid C. and Edelman, Alan and Marshall, John and Ramadhan, Ali and Souza, Andre and Ferrari, Raffaele , year =. Oceananigans.jl: A Julia library that achieves breakthrough resolut...

  30. [30]

    , year =

    Smagorinsky, J. , year =. General circulation experiments with the primitive equations: I. The basic experiment , shorttitle =. Monthly Weather Review , volume =

  31. [31]

    2004 , month = nov, journal =

    A finite element model for the Venice Lagoon: Development, setup, calibration, and validation , author =. 2004 , month = nov, journal =. doi:10.1016/j.jmarsys.2004.05.009 , urldate =

  32. [32]

    2025 , month = feb, number =

    High-level, high-resolution ocean modeling at all scales with Oceananigans , author =. 2025 , month = feb, number =. doi:10.48550/arXiv.2502.14148 , urldate =. 2502.14148 , archiveprefix =

  33. [33]

    and Danilov, S

    Wang, Q. and Danilov, S. and Sidorenko, D. and Timmermann, R. and Wekerle, C. and Wang, X. and Jung, T. and Schr. The finite element sea ice-ocean model (FESOM) v1.4: Formulation of an ocean general circulation model , shorttitle =. 2014 , month = apr, journal =. doi:10.5194/gmd-7-663-2014 , urldate =

  34. [34]

    2024 , month = nov, journal =

    Accelerating LASG/IAP climate system ocean model version 3 for performance portability using Kokkos , author =. 2024 , month = nov, journal =. doi:10.1016/j.future.2024.06.029 , urldate =

  35. [35]

    2008 , month = mar, journal =

    A basin- to channel-scale unstructured grid hurricane storm surge model applied to southern Louisiana , author =. 2008 , month = mar, journal =. doi:10.1175/2007MWR1946.1 , urldate =

  36. [36]

    and Huang, Xiaomeng and Zhang, Yan and Fu, Haohuan and Oey, Lie-Yauw and Xu, Fanghua and Yang, G

    Xu, S. and Huang, Xiaomeng and Zhang, Yan and Fu, Haohuan and Oey, Lie-Yauw and Xu, Fanghua and Yang, G. , year =. gpuPOM: A GPU-based Princeton Ocean Model , shorttitle =. Geoscientific Model Development Discussions , volume =

  37. [37]

    2016 , month = jun, journal =

    Seamless cross-scale modeling with SCHISM , author =. 2016 , month = jun, journal =. doi:10.1016/j.ocemod.2016.05.002 , urldate =

  38. [38]

    2008 , month = aug, journal =

    A multi-scale model of the hydrodynamics of the whole Great Barrier Reef , author =. 2008 , month = aug, journal =. doi:10.1016/j.ecss.2008.03.016 , urldate =

  39. [39]

    2023 , month = jun, journal =

    Biophysical model resolution affects coral connectivity estimates , author =. 2023 , month = jun, journal =. doi:10.1038/s41598-023-36158-5 , urldate =

  40. [40]

    2010 , journal =

    Multi-scale modelling of coastal, shelf, and global ocean dynamics , author =. 2010 , journal =

  41. [41]

    2006 , month = dec, journal =

    Algorithms for density, potential temperature, conservative temperature, and the freezing temperature of seawater , author =. 2006 , month = dec, journal =. doi:10.1175/JTECH1946.1 , urldate =

  42. [42]

    2013 , month = dec, journal =

    Multiscale modeling of coastal, shelf, and global ocean dynamics , author =. 2013 , month = dec, journal =. doi:10.1007/s10236-013-0655-8 , urldate =

  43. [43]

    2013 , month = jan, journal =

    A baroclinic discontinuous Galerkin finite element model for coastal flows , author =. 2013 , month = jan, journal =. doi:10.1016/j.ocemod.2012.09.009 , urldate =

  44. [44]

    2014 , month = jan, journal =

    An efficient parallel implementation of explicit multirate Runge--Kutta schemes for discontinuous Galerkin computations , author =. 2014 , month = jan, journal =. doi:10.1016/j.jcp.2013.07.041 , urldate =

  45. [45]

    2003 , month = jan, journal =

    A generic length-scale equation for geophysical turbulence models , author =. 2003 , month = jan, journal =

  46. [46]

    2025 , month = jan, journal =

    A multi-scale IMEX second-order Runge--Kutta method for 3D hydrodynamic ocean models , author =. 2025 , month = jan, journal =. doi:10.1016/j.jcp.2024.113482 , urldate =

  47. [47]

    2023 , month = dec, journal =

    A split-explicit second-order Runge--Kutta method for solving 3D hydrodynamic equations , author =. 2023 , month = dec, journal =. doi:10.1016/j.ocemod.2023.102273 , urldate =

  48. [48]

    2020 , month = jun, journal =

    Discontinuous Galerkin discretization for two-equation turbulence closure models , author =. 2020 , month = jun, journal =. doi:10.1016/j.ocemod.2020.101619 , urldate =

  49. [49]

    Frontiers in Applied Mathematics, vol

    Discontinuous Galerkin methods for solving elliptic and parabolic equations , author =. 2008 , month = jan, series =. doi:10.1137/1.9780898717440 , isbn =

  50. [50]

    2014 , journal =

    Penalty-free discontinuous Galerkin methods for incompressible Navier--Stokes equations , author =. 2014 , journal =

  51. [51]

    Proceedings of the ACM/IEEE Supercomputing Conference (SC) , year=

    The TOP500 list and progress in high-performance computing , author=. Proceedings of the ACM/IEEE Supercomputing Conference (SC) , year=