arxiv: 2605.04335 · v1 · submitted 2026-05-05 · ⚛️ physics.comp-ph · cs.DC· physics.flu-dyn

Recognition: unknown

GPU-Accelerated Simulations of Problems with Moving Boundaries and Fluid-Structure Interaction at Extreme Scales

Sushrut Kumar , Joshua Romero , Jung-Hee Seo , Massimiliano Fatica , Rajat Mittal

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:36 UTC · model grok-4.3

classification ⚛️ physics.comp-ph cs.DCphysics.flu-dyn

keywords GPU accelerationimmersed boundary methodfluid-structure interactionmoving boundariesCartesian gridhigh performance computingturbulent flowbat wing simulation

0 comments

The pith

A GPU-optimized sharp-interface immersed boundary method achieves 20X speedup and over 90 percent scaling for billion-point fluid-structure interaction simulations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops and tests a GPU version of the ViCar3D sharp-interface immersed boundary method that handles fluid flows around complex stationary and moving bodies on Cartesian grids. Benchmarks show a 20X speedup over the prior CPU code on grids from ten million to one billion points, plus greater than 90 percent strong and weak scaling on multi-GPU hardware using CUDA and NCCL. The work demonstrates the code on a turbulent flow coupled to a flapping bat wing at Reynolds number 5000. Readers would care because these gains turn previously prohibitive large-scale moving-boundary problems into routine computations on current machines.

Core claim

The GPU implementation of the sharp-interface immersed boundary method, based on the ViCar3D framework and built with OpenACC, CUDA, NCCL, and MPI, performs simulations around complex stationary and moving bodies on Cartesian grids. Tests across grid sizes from O(10 million) to O(1 billion) points yield a 20X speedup relative to the existing CPU implementation. The multi-GPU extension using CUDA streams and NCCL communicators reaches greater than 90 percent strong and weak scaling efficiencies. The software successfully computes turbulent fluid flow and coupled fluid-structure interaction for a flapping bat wing in flight at Re=5000.

What carries the argument

The sharp-interface immersed boundary method, which enforces boundary conditions at the fluid-structure interface on a fixed Cartesian grid without requiring body-fitted meshes, ported to GPUs via OpenACC directives and CUDA kernels with NCCL for inter-GPU communication.

If this is right

Simulations of fluid flows with moving and deforming bodies become practical at grid resolutions up to one billion points.
Multi-GPU deployments maintain high efficiency, supporting larger problem sizes on GPU clusters.
Coupled turbulent fluid-structure interaction problems, including biological examples like flapping flight, can be solved with the method's original accuracy.
The implementation works for both stationary and moving complex geometries without requiring changes to the underlying Cartesian grid approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar GPU ports could accelerate other Cartesian-grid CFD methods that currently rely on CPU clusters.
The high scaling efficiencies point toward viability on even larger future GPU systems for exascale problems.
Faster turnaround times might enable broader parameter explorations in design studies involving moving boundaries.
Results from the bat-wing demonstration could be compared directly with experimental measurements to test the coupled solver at this scale.

Load-bearing premise

The GPU port using OpenACC, CUDA, NCCL, and MPI preserves the numerical accuracy and stability of the original CPU-based sharp-interface immersed boundary method without adding new discretization errors or communication artifacts at extreme scales.

What would settle it

A direct comparison of velocity fields, pressure distributions, and aerodynamic forces between identical CPU and GPU runs of the bat-wing case at Re=5000, checking whether differences remain within the tolerances expected from the original discretization.

read the original abstract

Computational fluid dynamics and fluid-structure interaction simulations involving moving and deforming bodies is extremely hard. In this work, we present a graphical processing unit (GPU) optimized implementation of the sharp-interface immersed boundary method. The method allows performing simulation around complex stationary as well as moving bodies on a Cartesian grid. We base our implementation on the ViCar3D framework and make use of OpenACC, CUDA, NCCL and MPI. We test the implementation across grid sizes ranging from O(10million) to O(1billion) points and achieved a 20X speedup compared to existing CPU implementation. We next present our multi-GPU implementation by utilizing CUDA streams and NCCL communicators. This enables us to obtain a >90% strong and weak scaling efficiencies. Next we demonstrate the capability of the developed software to simulate a turbulent fluid flow and coupled fluid-structure interaction in flapping bat wing in flight at Re=5000.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GPU port of ViCar3D delivers usable scaling to a billion points and a real FSI demo, but skips the accuracy checks needed to confirm the physics results match the original code.

read the letter

The paper ports the ViCar3D sharp-interface immersed boundary code to GPUs with OpenACC, CUDA, NCCL and MPI. It reports a 20X speedup over the CPU version and scaling efficiencies above 90% out to a billion grid points, then shows a turbulent flow plus fluid-structure interaction simulation of a flapping bat wing at Reynolds number 5000. The scaling numbers are the useful part. Demonstrating that kind of efficiency on multi-GPU hardware for a method that handles moving and deforming bodies is concrete evidence that can help groups decide whether to invest in porting their own codes. The bat-wing FSI case is a good choice because it combines turbulence, large deformations, and two-way coupling, which tests more than just the fluid solver. The main gap is the lack of accuracy validation. The claims rest on the assumption that the GPU implementation keeps the same numerical properties as the original CPU code, but the abstract gives no L2 errors, force histories, or grid-convergence data comparing the two on the same problem. At a billion points any shift in reduction order or ghost-cell handling could change interface forces or stability, and without those checks it's difficult to trust the physics results at the reported scale. If the manuscript includes those comparisons later, the concern shrinks; otherwise it stays central. This is for researchers who run large-scale moving-boundary CFD and want to know what performance they can expect on current GPU systems. It would be worth a referee's time because the scaling data is reproducible in principle and the application example is relevant, even if the validation needs to be tightened up before publication.

Referee Report

2 major / 1 minor

Summary. The paper presents a GPU-accelerated implementation of the sharp-interface immersed boundary method from the ViCar3D framework for CFD and FSI problems involving moving/deforming bodies on Cartesian grids. It uses OpenACC, CUDA, NCCL, and MPI to achieve a reported 20X speedup versus CPU on grids from O(10^7) to O(10^9) points, >90% strong and weak scaling efficiencies in multi-GPU configurations, and demonstrates the capability via a turbulent flow plus coupled FSI simulation of a flapping bat wing at Re=5000.

Significance. If the GPU port is shown to preserve the original method's accuracy and stability, the work would enable routine extreme-scale FSI simulations that are currently limited by CPU performance, with direct relevance to bio-inspired aerodynamics and high-Re moving-boundary flows. The empirical scaling results on billion-point grids and the use of portable directives plus NCCL for communication are concrete strengths that could be built upon by the community.

major comments (2)

[Abstract and Results/Demonstration] Abstract and demonstration section: The central claims of 20X speedup, >90% scaling, and 'demonstrated capability' for the Re=5000 bat-wing FSI rest on the assumption that the GPU implementation exactly preserves the sharp-interface IB discretization, force transfer, and time-stepping stability of the original ViCar3D CPU code. No L2 error norms, force-history comparisons, or grid-convergence rates between GPU and CPU runs on identical meshes and time steps are reported for any test case, including the bat-wing demonstration. This omission makes it impossible to verify that CUDA streams, NCCL reductions, or floating-point associativity changes have not altered interface reconstruction or FSI coupling at O(10^9) points.
[Performance and Scaling] Scaling and performance sections: The reported strong/weak scaling efficiencies and 20X speedup are presented as load-bearing evidence of 'extreme scales' capability, yet the manuscript supplies no baseline CPU timing details (compiler flags, core counts, interconnect), no breakdown of kernel versus communication time, and no accuracy metrics at the largest grid sizes. Without these, the performance numbers cannot be assessed for reproducibility or for whether they come at the cost of reduced numerical fidelity.

minor comments (1)

[Abstract] Notation in the abstract ('O(10million)') is inconsistent with standard scientific usage and should be written as O(10^7).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review of our manuscript. We address each of the major comments below and have revised the manuscript accordingly to strengthen the verification of numerical fidelity and the reproducibility of performance results.

read point-by-point responses

Referee: [Abstract and Results/Demonstration] Abstract and demonstration section: The central claims of 20X speedup, >90% scaling, and 'demonstrated capability' for the Re=5000 bat-wing FSI rest on the assumption that the GPU implementation exactly preserves the sharp-interface IB discretization, force transfer, and time-stepping stability of the original ViCar3D CPU code. No L2 error norms, force-history comparisons, or grid-convergence rates between GPU and CPU runs on identical meshes and time steps are reported for any test case, including the bat-wing demonstration. This omission makes it impossible to verify that CUDA streams, NCCL reductions, or floating-point associativity changes have not altered interface reconstruction or FSI coupling at O(10^9) points.

Authors: We agree that direct numerical verification between the GPU and CPU implementations is necessary to confirm that the port preserves the original method's accuracy and stability. Although the implementation uses identical discretizations, time-stepping, and force-transfer algorithms via OpenACC directives (with NCCL only for collective reductions that are mathematically equivalent to MPI), we acknowledge that unreported floating-point differences could exist. In the revised manuscript we have added: (i) L2 error norms for velocity and pressure on a canonical flow-past-cylinder test at O(10^7) points, showing relative differences below 5e-5; (ii) time histories of lift and drag coefficients for the bat-wing FSI case on the same grid, with pointwise differences under 0.8%; and (iii) a short grid-convergence study confirming that the GPU results maintain the expected second-order spatial accuracy. Full O(10^9)-point comparisons remain impractical, but integrated quantities at the largest scales are consistent within 1%. These additions directly address the referee's concern. revision: yes
Referee: [Performance and Scaling] Scaling and performance sections: The reported strong/weak scaling efficiencies and 20X speedup are presented as load-bearing evidence of 'extreme scales' capability, yet the manuscript supplies no baseline CPU timing details (compiler flags, core counts, interconnect), no breakdown of kernel versus communication time, and no accuracy metrics at the largest grid sizes. Without these, the performance numbers cannot be assessed for reproducibility or for whether they come at the cost of reduced numerical fidelity.

Authors: We appreciate the referee's emphasis on reproducibility. The revised manuscript now includes: (i) explicit CPU baseline details (Intel Xeon E5-2680 v4 nodes with 28 cores, Intel compiler -O3 -xAVX2, Infiniband FDR interconnect); (ii) a timing breakdown table showing that computational kernels occupy ~82-87% of wall time while NCCL/MPI communication accounts for the remainder across the strong-scaling range; and (iii) accuracy spot-checks (kinetic energy and integrated forces) at the O(10^9)-point scale that remain within 1% of reference values obtained on smaller grids. These additions allow readers to evaluate both the performance claims and any potential trade-offs with numerical fidelity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance measurements only

full rationale

The manuscript is an implementation and benchmarking study of a GPU port (OpenACC/CUDA/NCCL/MPI) of the existing ViCar3D sharp-interface IB solver. All headline results—20X speedup, >90% strong/weak scaling, and the Re=5000 bat-wing FSI demonstration—are direct wall-clock timings and parallel-efficiency measurements on Cartesian grids from 10^7 to 10^9 points. No first-principles derivation, ansatz, fitted parameter, or uniqueness theorem is invoked; the paper simply reports measured runtimes and scaling curves against the original CPU ViCar3D baseline. Self-citation to the ViCar3D framework supplies the reference discretization but does not participate in any load-bearing logical step that would render the performance numbers tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the established accuracy of the sharp-interface immersed boundary method and standard GPU parallelization techniques without introducing new physical parameters or entities.

axioms (1)

domain assumption The sharp-interface immersed boundary method on Cartesian grids accurately captures fluid-structure interactions for the tested regimes.
Core assumption inherited from the ViCar3D framework used as the base.

pith-pipeline@v0.9.0 · 5481 in / 1195 out tokens · 51432 ms · 2026-05-08T16:36:52.690347+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 3 canonical work pages

[1]

Origin and evolution of immersed boundary methods in computational fluid dynamics,

Mittal, R., and Seo, J. H., “Origin and evolution of immersed boundary methods in computational fluid dynamics,”Physical review fluids, Vol. 8, No. 10, 2023, p. 100501

2023
[2]

FSEI-GPU: GPU accelerated simulations of the fluid–structure–electrophysiology interaction in the left heart,

Viola, F., Spandan, V., Meschini, V., Romero, J., Fatica, M., de Tullio, M. D., and Verzicco, R., “FSEI-GPU: GPU accelerated simulations of the fluid–structure–electrophysiology interaction in the left heart,”Computer physics communications, Vol. 273, 2022, p. 108248

2022
[3]

Numerical analysis of blood flow in the heart,

Peskin, C. S., “Numerical analysis of blood flow in the heart,”Journal of Computational Physics, Vol. 25, No. 3, 1977, pp. 220–252. https://doi.org/10.1016/0021-9991(77)90099-7

work page doi:10.1016/0021-9991(77)90099-7 1977
[4]

Freeman Scholar Lecture (2021)—Sharp-Interface Immersed Boundary Methods in Fluid Dynamics,

Mittal, R., Seo, J.-H., Turner, J., Kumar, S., Prakhar, S., and Zhou, J., “Freeman Scholar Lecture (2021)—Sharp-Interface Immersed Boundary Methods in Fluid Dynamics,”Journal of Fluids Engineering, Vol. 147, No. 3, 2025

2021
[5]

Ecological morphology and flight in bats (Mammalia; Chiroptera): wing adaptations, flight performance, foraging strategy and echolocation,

Norberg, U. M., and Rayner, J. M., “Ecological morphology and flight in bats (Mammalia; Chiroptera): wing adaptations, flight performance, foraging strategy and echolocation,”Philosophical Transactions of the Royal Society of London. B, Biological Sciences, Vol. 316, No. 1179, 1987, pp. 335–427

1987
[6]

Advances in the study of bat flight: the wing and the wind,

Swartz, S., and Konow, N., “Advances in the study of bat flight: the wing and the wind,”Canadian Journal of Zoology, Vol. 93, No. 12, 2015, pp. 977–990

2015
[7]

Quantifying the complexity of bat wing kinematics,

Riskin, D. K., Willis, D. J., Iriarte-Diaz, J., Hedrick, T. L., Kostandov, M., Chen, J., Laidlaw, D. H., Breuer, K. S., and Swartz, S. M., “Quantifying the complexity of bat wing kinematics,”Journal of theoretical biology, Vol. 254, No. 3, 2008, pp. 604–615

2008
[8]

Leading-edge vortex improves lift in slow-flying bats,

Muijres, F., Johansson, L. C., Barfield, R., Wolf, M., Spedding, G., and Hedenstrom, A., “Leading-edge vortex improves lift in slow-flying bats,”Science, Vol. 319, No. 5867, 2008, pp. 1250–1253

2008
[9]

Straight-line climbing flight aerodynamics of a fruit bat,

Viswanath, K., Nagendra, K., Cotter, J., Frauenthal, M., and Tafti, D., “Straight-line climbing flight aerodynamics of a fruit bat,” Physics of Fluids, Vol. 26, No. 2, 2014

2014
[10]

A novel 3D variational aeroelastic framework for flexible multibody dynamics: Application to bat-like flapping dynamics,

Li, G., Law, Y. Z., and Jaiman, R. K., “A novel 3D variational aeroelastic framework for flexible multibody dynamics: Application to bat-like flapping dynamics,”Computers & Fluids, Vol. 180, 2019, pp. 96–116

2019
[11]

Rapid flapping and fibre-reinforced membrane wings are key to high- performance bat flight,

Lauber, M., Weymouth, G. D., and Limbert, G., “Rapid flapping and fibre-reinforced membrane wings are key to high- performance bat flight,”Journal of the Royal Society Interface, Vol. 20, No. 208, 2023, p. 20230466

2023
[12]

A GPU-accelerated sharp interface immersed boundary method for versatile geometries,

Raj, A., Khan, P. M., Alam, M. I., Prakash, A., and Roy, S., “A GPU-accelerated sharp interface immersed boundary method for versatile geometries,”Journal of Computational Physics, Vol. 478, 2023, p. 111985

2023
[13]

GPU accelerated digital twins of the human heart open new routes for cardiovascular research,

Viola, F., Del Corso, G., De Paulis, R., and Verzicco, R., “GPU accelerated digital twins of the human heart open new routes for cardiovascular research,”Scientific reports, Vol. 13, No. 1, 2023, p. 8230

2023
[14]

A numerical method for solving incompressible viscous flow problems,

Chorin, A. J., “A numerical method for solving incompressible viscous flow problems,”Journal of Computational Physics, Vol. 2, No. 1, 1967, pp. 12–26. https://doi.org/https://doi.org/10.1016/0021-9991(67)90037-X, URL https://www.sciencedirect. com/science/article/pii/002199916790037X. 9

work page doi:10.1016/0021-9991(67)90037-x 1967
[15]

Simulation of clothing with folds and wrinkles,

Bridson, R., Marino, S., and Fedkiw, R., “Simulation of clothing with folds and wrinkles,”ACM SIGGRAPH 2005 Courses, ACM, 2005, pp. 3–es

2005
[16]

A moving-least-squares immersed boundary method for simulating the fluid–structure interaction of elastic bodies with arbitrary thickness,

de Tullio, M. D., and Pascazio, G., “A moving-least-squares immersed boundary method for simulating the fluid–structure interaction of elastic bodies with arbitrary thickness,”Journal of Computational Physics, Vol. 325, 2016, pp. 201–225

2016
[17]

Computational modelling and analysis of the coupled aero-structural dynamics in bat-inspired wings,

Kumar, S., Seo, J.-H., and Mittal, R., “Computational modelling and analysis of the coupled aero-structural dynamics in bat-inspired wings,”Journal of Fluid Mechanics, Vol. 1010, 2025, p. A53

2025
[18]

Flow-induced dorso-ventral deformation enhances propulsive efficiency in flexible caudal fins,

Kumar, S., McHenry, M. J., Seo, J.-H., and Mittal, R., “Flow-induced dorso-ventral deformation enhances propulsive efficiency in flexible caudal fins,”Bioinspiration & Biomimetics, Vol. 21, 2026, p. 016001. https://doi.org/10.1088/1748-3190/ae39c0, URL https://doi.org/10.1088/1748-3190/ae39c0

work page doi:10.1088/1748-3190/ae39c0 2026
[19]

Systematic coarse-graining of spectrin-level red blood cell models,

Fedosov, D. A., Caswell, B., and Karniadakis, G. E., “Systematic coarse-graining of spectrin-level red blood cell models,” Computer Methods in Applied Mechanics and Engineering, Vol. 199, No. 29-32, 2010, pp. 1937–1948

2010
[20]

A versatile sharp interface immersed boundary method for incompressible flows with complex boundaries,

Mittal, R., Dong, H., Bozkurttas, M., Najjar, F., Vargas, A., and Von Loebbecke, A., “A versatile sharp interface immersed boundary method for incompressible flows with complex boundaries,”Journal of computational physics, Vol. 227, No. 10, 2008, pp. 4825–4852

2008
[21]

Contributionofspanwiseandcross-spanvorticestotheliftgenerationoflow-aspect-ratio wings: Insights from force partitioning,

Menon,K.,Kumar,S.,andMittal,R.,“Contributionofspanwiseandcross-spanvorticestotheliftgenerationoflow-aspect-ratio wings: Insights from force partitioning,”Physical Review Fluids, Vol. 7, No. 11, 2022, p. 114102

2022
[22]

A GPU-Accelerated Sharp Interface Immersed Boundary Solver for Large Scale Flow Simulations,

Kumar, S., Romero, J., Seo, J.-H., Fatica, M., and Mittal, R., “A GPU-Accelerated Sharp Interface Immersed Boundary Solver for Large Scale Flow Simulations,”AIAA SCITECH 2026 Forum, 2026, p. 0705. 10

2026