arxiv: 2604.18020 · v1 · submitted 2026-04-20 · 💻 cs.CE · cs.DC

Recognition: unknown

Matrix-Free 3D SIMP Topology Optimization with Fused Gather-GEMM-Scatter Kernels

Jun Wang, Shaoliang Yang, Yunsheng Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:49 UTC · model grok-4.3

classification 💻 cs.CE cs.DC

keywords SIMP topology optimizationmatrix-free finite elementsfused CUDA kernelGPU accelerationgather-GEMM-scatter3D structural optimizationcomputational engineering

0 comments

The pith

A single fused CUDA kernel performs gather, batched GEMM and scatter in one pass to accelerate matrix-free 3D SIMP topology optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a fused CUDA kernel that combines the gather, per-element stiffness multiplication, and scatter accumulation steps into a single kernel launch for three-dimensional SIMP topology optimization. This design removes the global stiffness matrix assembly and the intermediate DRAM writes required by the conventional three-stage implementation. On an RTX 4090 GPU the fused approach produces measured wall-time speedups between 4.6x and 7.3x for cantilever problems ranging from 216k to 4.9M elements, together with 4.4x on a 499k-element torsion case and corresponding energy reductions. The same kernel also yields 8.9-13.8x isolated speedups per matrix-vector product while preserving the matrix-free character of the underlying finite-element scheme.

Core claim

By implementing the entire gather-batched-GEMM-scatter sequence inside one CuPy-compiled CUDA kernel, the method eliminates intermediate global memory traffic between the three conventional stages, delivering end-to-end SIMP wall-time reductions of 4.6-7.3x on cantilever benchmarks and 4.4x on the torsion benchmark when executed on a single RTX 4090, while separate power traces show 3.2-4.9x lower energy use relative to matched FP64 three-stage runs.

What carries the argument

The fused gather-GEMM-scatter CUDA kernel that executes the complete per-element matrix-vector product and accumulation in a single kernel launch, thereby removing all intermediate DRAM traffic between gather, multiplication and scatter stages.

If this is right

End-to-end SIMP wall time drops by 4.6-7.3x on cantilever problems with 216k to 4.9M elements and by 4.4x on the 499k-element torsion benchmark.
Isolated operator calls achieve 8.9-13.8x speedup when measured with CUDA events.
Board-level power traces indicate 3.2-4.9x lower energy consumption at 216k and 1M element sizes relative to FP64 three-stage runs.
A separate bridge model under distributed load reproduces the same FP32-versus-FP64 trend observed in the cantilever tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reported speedups imply that single-consumer-GPU runs of industrial-scale 3D topology optimization become practical without cluster resources.
The BF16 proxy speed of 14.3x combined with condition-number estimates of 6.1e5-2.3e6 indicates that mixed-precision variants require additional stabilization before they can replace FP32 in iterative solvers.
Because the fusion targets only the matvec kernel, the same pattern can be inserted into any matrix-free finite-element loop that currently uses separate gather-multiply-scatter stages.

Load-bearing premise

The fused kernel produces results that are numerically equivalent to the three-stage baseline, yielding the same convergence behavior and final designs.

What would settle it

Running both the fused kernel and the three-stage baseline on the same cantilever or torsion mesh and comparing the full objective-function histories plus the final density fields would confirm or refute numerical equivalence.

Figures

Figures reproduced from arXiv: 2604.18020 by Jun Wang, Shaoliang Yang, Yunsheng Wang.

**Figure 2.** Figure 2: BF16 WMMA tensor-core tiling for the per-element stiffness multiply. [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Synthetic hot-path timing breakdown at four element counts. Bars show the FP64 [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: Cold-start FEA solve wall time versus element count for the three cantilever solver [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: End-to-end SIMP-120 wall time versus number of elements for the cantilever bench [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Baseline-to-fused SIMP-120 speedup at five element counts. The blue bars report [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Wall-time comparison against local CPU/PyPardiso PyTopo3D reruns [ [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Selected cantilever topology, 120×60×30 = 216,000 elements (selected design iterate from a separate same-configuration fused FP32 SIMP-120 rerun, Vf = 0.30; smoothed marchingcubes isosurface at ρ = 0.5 with 80-step Laplacian smoothing, relaxation factor 0.08). Left: side-profile view (XY plane) showing the classic arrowhead-arch truss with two oval cutouts. Right: isometric view revealing the three-dimens… view at source ↗

**Figure 9.** Figure 9: Selected cantilever topology at 200 × 100 × 50 = 1,000,000 elements (selected design iterate from a separate same-configuration fused FP32 SIMP-120 rerun, Vf = 0.30; smoothed marching-cubes isosurface at ρ = 0.5 with 80-step Laplacian smoothing, relaxation factor 0.08). Left: side-profile view. Right: isometric view. The higher resolution resolves a cleaner arrowhead arch at the same qualitative design fam… view at source ↗

**Figure 10.** Figure 10: SIMP-120 convergence history for the 499,125-element torsion benchmark. [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Selected topology for the 499,125-element torsion shaft benchmark (selected design [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Separate qualitative exemplar from the bridge-family load class: selected topology for [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Representative CG residual decay for a cold-start linear solve at uniform density [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Centered-patch cantilever topology from a separate fused-FP32 SIMP rerun outside [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗

read the original abstract

The matrix-free gather-batched-GEMM-scatter pattern eliminates global stiffness assembly for three-dimensional SIMP topology optimization, but the conventional three-stage implementation forces avoidable DRAM traffic between stages. We present a single fused CUDA kernel, implemented through CuPy's runtime compilation interface, that performs gather, per-element stiffness multiplication, and scatter accumulation in one pass. On a single RTX 4090 (24 GB), the fused path reaches a problem-size-dependent 4.6-7.3x end-to-end SIMP wall-time speedup across 216k-4.9M cantilever elements and 4.4x on the 499,125-element torsion benchmark. Against the same-precision FP32 three-stage baseline, the fused path still yields 2.3-4.6x on cantilever and 2.8x on torsion. Isolated CUDA-event cantilever-operator measurements reach 8.9-13.8x per matvec call, while separate instrumented board-power traces at 216k and 1M show 3.2-4.9x lower energy than matched FP64 runs. A separate bridge stress test shows the same FP32-versus-FP64 three-stage trend under one distributed-load case; direct fused-kernel bridge benchmarks are not reported. We also evaluate a BF16 WMMA variant: a separate PyTorch BF16 GEMM proxy on matching tensor shapes yields 14.3x, but direct condition-number estimates of 6.1e5-2.3e6 across 64k-512k uniform-density test states imply BF16 conditioning products of 2.4e3-9.1e3, far above the 256 threshold, observed alongside BF16 iterative-refinement stagnation at the two tested inner tolerances.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fused gather-GEMM-scatter kernel delivers concrete GPU speedups for 3D SIMP but leaves numerical equivalence to the baseline unverified.

read the letter

The paper's main contribution is a single-pass CUDA kernel, written through CuPy, that fuses the gather, per-element stiffness multiply, and scatter accumulation for matrix-free 3D SIMP. This removes the intermediate DRAM traffic of the usual three-stage sequence. On an RTX 4090 they measure 4.6-7.3x end-to-end wall-time gains on cantilever problems from 216k to 4.9M elements and 4.4x on the torsion benchmark, with 8.9-13.8x isolated matvec speedups and 3.2-4.9x lower energy versus FP64 runs. They also report a BF16 proxy test and correctly flag its poor conditioning for this problem class. These are straightforward, hardware-tied numbers on standard benchmarks, which is more than many optimization papers provide. The implementation is reproducible in principle since it uses CuPy runtime compilation. The central weakness is the missing check that the fused path produces identical FP32 results. Scatter accumulation order is not guaranteed to match the three-stage version, and non-associative floating-point addition can create small differences that accumulate over hundreds of SIMP iterations. No side-by-side matvec outputs, CG residual histories, or final density fields are shown, so we cannot tell whether convergence behavior or the optimized designs stay the same. This is a practical engineering paper for people who already run large 3D topology optimization on single GPUs and want shorter runtimes without changing the physics or algorithm. It is not a new method or theory. I would bring the kernel description to a reading group for the implementation details. I would not cite it in my own work unless I needed exactly this pattern. It deserves peer review because the performance claims are specific and falsifiable; referees can ask for the equivalence tests and full source. The work is honest about the BF16 limitations it uncovers.

Referee Report

2 major / 2 minor

Summary. The paper presents a single fused CUDA kernel (via CuPy runtime compilation) that combines gather, per-element stiffness multiplication (batched GEMM), and scatter accumulation for matrix-free 3D SIMP topology optimization. This eliminates intermediate DRAM traffic from the conventional three-stage gather-GEMM-scatter sequence. On an RTX 4090, the fused implementation is reported to deliver 4.6-7.3x end-to-end wall-time speedups for cantilever problems (216k-4.9M elements) and 4.4x on a 499k-element torsion benchmark, with 2.3-4.6x gains even versus an FP32 three-stage baseline; isolated matvec speedups reach 8.9-13.8x, and energy reductions of 3.2-4.9x versus FP64 are shown via board-power traces. A BF16 WMMA variant is also evaluated via proxy, with conditioning analysis indicating potential numerical issues.

Significance. If numerical equivalence to the baseline is confirmed, the work provides a concrete demonstration of how kernel fusion can substantially accelerate memory-bound matrix-free topology optimization on modern GPUs, with reproducible speedups and energy metrics on specific large-scale 3D benchmarks. The empirical focus on end-to-end SIMP iterations, power instrumentation, and problem-size scaling offers practical value for computational mechanics applications, though the BF16 results underscore precision trade-offs.

major comments (2)

[Results] The headline performance claims (4.6-7.3x end-to-end, 8.9-13.8x per matvec) are load-bearing on the assumption that the fused gather-GEMM-scatter kernel produces FP32 results numerically identical to the three-stage baseline. The manuscript reports wall-clock times, power traces, and BF16 conditioning numbers but supplies no direct comparison of matvec output vectors, CG residual histories, or final density fields on any cantilever or torsion instance (see Results and Experiments sections). Because scatter accumulation order is not guaranteed to match and FP addition is non-associative, even small per-element differences could accumulate over hundreds of SIMP iterations and change convergence behavior or the optimized design.
[BF16 Analysis] The BF16 evaluation relies on a separate PyTorch proxy for WMMA GEMM timing (14.3x) while reporting condition-number products of 2.4e3-9.1e3 (far above the 256 threshold) and iterative-refinement stagnation for the two tested inner tolerances. Direct fused-kernel BF16 benchmarks are absent for the bridge stress-test case, where only three-stage FP32-vs-FP64 trends are shown; this weakens the claim that lower-precision paths remain viable for the full SIMP workflow.

minor comments (2)

The abstract and results report speedup ranges without error bars, standard deviations, or repeated-run statistics; adding these would strengthen the empirical claims.
[Implementation] Full source code or pseudocode for the CuPy-compiled fused kernel is not included, limiting reproducibility of the gather-scatter indexing and accumulation logic.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive comments. We address the major points below and have revised the manuscript to strengthen the numerical verification and clarify the BF16 analysis.

read point-by-point responses

Referee: [Results] The headline performance claims (4.6-7.3x end-to-end, 8.9-13.8x per matvec) are load-bearing on the assumption that the fused gather-GEMM-scatter kernel produces FP32 results numerically identical to the three-stage baseline. The manuscript reports wall-clock times, power traces, and BF16 conditioning numbers but supplies no direct comparison of matvec output vectors, CG residual histories, or final density fields on any cantilever or torsion instance (see Results and Experiments sections). Because scatter accumulation order is not guaranteed to match and FP addition is non-associative, even small per-element differences could accumulate over hundreds of SIMP iterations and change convergence behavior or the optimized design.

Authors: We agree that direct numerical verification is essential. Although the fused kernel executes identical operations, the scatter accumulation order can differ, potentially affecting FP addition associativity. In the revised manuscript we have added a dedicated subsection with side-by-side comparisons on the 216k-element cantilever and 499k-element torsion cases: L2-norm differences between fused and three-stage matvec outputs are below 1e-6 relative error; CG iteration counts and residual histories match to solver tolerance; and the final optimized density fields are identical within the reported convergence criteria. We also added a short discussion of floating-point associativity and why the observed differences do not propagate to alter designs or convergence in this application. revision: yes
Referee: [BF16 Analysis] The BF16 evaluation relies on a separate PyTorch proxy for WMMA GEMM timing (14.3x) while reporting condition-number products of 2.4e3-9.1e3 (far above the 256 threshold) and iterative-refinement stagnation for the two tested inner tolerances. Direct fused-kernel BF16 benchmarks are absent for the bridge stress-test case, where only three-stage FP32-vs-FP64 trends are shown; this weakens the claim that lower-precision paths remain viable for the full SIMP workflow.

Authors: We acknowledge the limitations of the BF16 section. The timing estimate uses a PyTorch proxy because a full fused BF16 WMMA kernel was not implemented within the CuPy runtime-compilation framework used for the main results. The condition-number and iterative-refinement data already indicate that BF16 is numerically problematic. In revision we have (i) explicitly stated that BF16 results are exploratory only and that lower-precision paths are not recommended without further safeguards, (ii) clarified the proxy rationale and implementation constraints, and (iii) removed any implication that BF16 is viable for the complete SIMP workflow. Direct fused BF16 benchmarks on the bridge case were not performed and would require substantial additional kernel development outside the present scope. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical timing and power results on concrete benchmarks

full rationale

The paper presents an implementation of a fused CUDA kernel for matrix-free 3D SIMP topology optimization and reports measured wall-clock speedups, energy savings, and isolated operator timings across specific problem sizes and hardware. No mathematical derivation chain, fitted parameters renamed as predictions, self-referential equations, or load-bearing self-citations appear in the claims. All central results are direct empirical observations from benchmark runs, with the only noted assumption (numerical equivalence of fused vs. three-stage paths) being an unverified empirical premise rather than a circular reduction. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The performance claims rest on the correctness of the CUDA kernel implementation and the representativeness of the cantilever and torsion benchmarks. No free parameters are fitted, no new physical entities are introduced, and the only background assumptions are standard floating-point arithmetic and GPU memory model behavior.

axioms (1)

standard math Standard IEEE-754 floating-point arithmetic and CUDA kernel execution semantics hold without unexpected rounding or memory-ordering effects.
Invoked implicitly for all timing and energy comparisons.

pith-pipeline@v0.9.0 · 5636 in / 1356 out tokens · 65203 ms · 2026-05-10T03:49:55.964275+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 53 canonical work pages

[1]

Martin P. Bendsøe. Optimal shape design as a material distribution problem.Structural Optimization, 1(4):193–202, 1989. doi: 10.1007/BF01650949

work page doi:10.1007/bf01650949 1989
[2]

Generalizedshapeoptimizationwithout homogenization.Structural Optimization, 4(3–4):250–252, 1992

GeorgeI.N.Rozvany, MingZhou, andTomBirker. Generalizedshapeoptimizationwithout homogenization.Structural Optimization, 4(3–4):250–252, 1992. doi: 10.1007/BF01742754

work page doi:10.1007/bf01742754 1992
[3]

(2003).Topology Optimization: Theory, Methods, and Applications

Martin P. Bendsøe and Ole Sigmund.Topology Optimization: Theory, Methods, and Ap- plications. Springer Berlin Heidelberg, 2004. doi: 10.1007/978-3-662-05086-6

work page doi:10.1007/978-3-662-05086-6 2004
[4]

Niels Aage, Erik Andreassen, and Boyan S. Lazarov. Topology optimization using PETSc: An easy-to-use, fully parallel, open source topology optimization framework.Structural and Multidisciplinary Optimization, 51(3):565–572, 2015. doi: 10.1007/s00158-014-1157-0

work page doi:10.1007/s00158-014-1157-0 2015
[5]

Lazarov, and Ole Sigmund

Niels Aage, Erik Andreassen, Boyan S. Lazarov, and Ole Sigmund. Giga-voxel com- putational morphogenesis for structural design.Nature, 550(7674):84–86, 2017. doi: 10.1038/nature23911

work page doi:10.1038/nature23911 2017
[6]

Megapixel topology optimization on a graphics pro- cessing unit.SIAM Review, 51(4):707–721, 2009

Eddie Wadbro and Martin Berggren. Megapixel topology optimization on a graphics pro- cessing unit.SIAM Review, 51(4):707–721, 2009. doi: 10.1137/070699822

work page doi:10.1137/070699822 2009
[7]

Challis, Anthony P

Vivien J. Challis, Anthony P. Roberts, and Joseph F. Grotowski. High resolution topology optimization using graphics processing units (GPUs).Structural and Multidisciplinary Optimization, 49(2):315–325, 2014. doi: 10.1007/s00158-013-0980-z

work page doi:10.1007/s00158-013-0980-z 2014
[8]

Large-scale robust topology optimization using multi-GPU systems.Computer Methods in Applied Mechanics and Engineering, 311: 393–414, 2016

Jesús Martínez-Frutos and David Herrero-Pérez. Large-scale robust topology optimization using multi-GPU systems.Computer Methods in Applied Mechanics and Engineering, 311: 393–414, 2016. doi: 10.1016/j.cma.2016.08.016

work page doi:10.1016/j.cma.2016.08.016 2016
[9]

Martínez Castejón

David Herrero-Pérez and Pedro J. Martínez Castejón. Multi-GPU acceleration of large- scale density-based topology optimization.Advances in Engineering Software, 157–158: 103006, 2021. doi: 10.1016/j.advengsoft.2021.103006. 36

work page doi:10.1016/j.advengsoft.2021.103006 2021
[10]

Parallel computing on GPU with CuPy and vectorized SpMV for large-scale topology optimization.Finite Elements in Analysis and Design, 250:104388, 2025

Jiangnan Hou, Jiajie Li, Shengfeng Zhu, Xindi Hu, and Zeyang Yu. Parallel computing on GPU with CuPy and vectorized SpMV for large-scale topology optimization.Finite Elements in Analysis and Design, 250:104388, 2025. doi: 10.1016/j.finel.2025.104388

work page doi:10.1016/j.finel.2025.104388 2025
[11]

An efficient GPU solver for 3d topology optimization of continuous fiber-reinforced composite structures.Computer Methods in Applied Mechanics and Engineering, 435:117675, 2025

Tianyuan Qi, Junpeng Zhao, and Chunjie Wang. An efficient GPU solver for 3d topology optimization of continuous fiber-reinforced composite structures.Computer Methods in Applied Mechanics and Engineering, 435:117675, 2025. doi: 10.1016/j.cma.2024.117675

work page doi:10.1016/j.cma.2024.117675 2025
[12]

Träff, Anton Rydahl, Sven Karlsson, Ole Sigmund, and Niels Aage

Erik A. Träff, Anton Rydahl, Sven Karlsson, Ole Sigmund, and Niels Aage. Simple and effi- cient GPU accelerated topology optimisation: Codes and applications.Computer Methods in Applied Mechanics and Engineering, 410:116043, 2023. doi: 10.1016/j.cma.2023.116043

work page doi:10.1016/j.cma.2023.116043 2023
[13]

CuPy: A NumPy-compatible library for NVIDIA GPU calculations

Ryosuke Okuta, Yuya Unno, Daisuke Nishino, Shohei Hido, and Crissman Loomis. CuPy: A NumPy-compatible library for NVIDIA GPU calculations. InML Systems Work- shop (LearningSys) at NIPS, 2017. URL https://tech.preferred.jp/en/publications/cupy- a-numpy-compatible-library-for-nvidia-gpu-calculations/

2017
[14]

Efficient large-scale 3D topology optimization with matrix-free MATLAB code.Structural and Multidisciplinary Optimization, 68(9):174, 2025

Junpeng Wang, Niels Aage, Jun Wu, Ole Sigmund, and Rüdiger Westermann. Efficient large-scale 3D topology optimization with matrix-free MATLAB code.Structural and Multidisciplinary Optimization, 68(9):174, 2025. doi: 10.1007/s00158-025-04127-3

work page doi:10.1007/s00158-025-04127-3 2025
[15]

Roofline: an insightful visual performance model for multicore architectures.Communications of the ACM, 52(4):65–76,

Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures.Communications of the ACM, 52(4):65–76,
[16]

doi: 10.1145/1498765.1498785

work page doi:10.1145/1498765.1498785
[17]

NVIDIA Ada GPU Architecture

NVIDIA Corporation. NVIDIA Ada GPU Architecture. https://images.nvidia.com/aem- dam/Solutions/Data-Center/l4/nvidia-ada-gpu-architecture-whitepaper-V2.02.pdf, 2023. Whitepaper, version 2.02; Appendix A includes GeForce RTX 4090 peak BF16, FP32, and memory-bandwidth specifications

2023
[18]

Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. NVIDIA tensor core programmability, performance & precision. In2018 IEEE Int. Parallel and Distributed Processing Symp. Workshops (IPDPSW), pages 522–531, 2018. doi: 10. 1109/IPDPSW.2018.00091

work page arXiv 2018
[19]

Higham, Mantas Mikaitis, and Srikara Pranesh

Massimiliano Fasi, Nicholas J. Higham, Mantas Mikaitis, and Srikara Pranesh. Numerical behavior of NVIDIA tensor cores.PeerJ Computer Science, 7:e330, 2021. doi: 10.7717/ peerj-cs.330

2021
[20]

Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J. Higham. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. InSC18: Int. Conf. for High Performance Computing, Networking, Storage and Analysis, pages 603–613, 2018. doi: 10.1109/SC.2018.00050

work page doi:10.1109/sc.2018.00050 2018
[21]

Acceleratingthesolutionoflinearsystemsbyiterative refinement in three precisions.SIAM Journal on Scientific Computing, 40(2):A817–A847,

ErinCarsonandNicholasJ.Higham. Acceleratingthesolutionoflinearsystemsbyiterative refinement in three precisions.SIAM Journal on Scientific Computing, 40(2):A817–A847,
[22]

doi: 10.1137/17M1140819

work page doi:10.1137/17m1140819
[23]

Acta Numerica31, 347–414 (2022) https://doi.org/10.1017/S0962492922000022

Nicholas J. Higham and Theo Mary. Mixed precision algorithms in numerical linear algebra. Acta Numerica, 31:347–414, 2022. doi: 10.1017/S0962492922000022

work page doi:10.1017/s0962492922000022 2022
[24]

Leveraging the BFloat16 AI datatype for higher-precision computations

Greg Henry, Ping Tak Peter Tang, and Alexander Heinecke. Leveraging the BFloat16 AI datatype for higher-precision computations. InProc. 26th IEEE Symposium on Computer Arithmetic (ARITH), pages 69–76, 2019. doi: 10.1109/ARITH.2019.00019. 37

work page doi:10.1109/arith.2019.00019 2019
[25]

Large-scale topology optimization in 3D using parallel computing.Computer Methods in Applied Mechanics and Engineering, 190(46– 47):6201–6229, 2001

Thomas Borrvall and Joakim Petersson. Large-scale topology optimization in 3D using parallel computing.Computer Methods in Applied Mechanics and Engineering, 190(46– 47):6201–6229, 2001. doi: 10.1016/S0045-7825(01)00216-X

work page doi:10.1016/s0045-7825(01)00216-x 2001
[26]

An efficient 3D topology optimization code written in Matlab

Kai Liu and Andrés Tovar. An efficient 3D topology optimization code written in Matlab. Structural and Multidisciplinary Optimization, 50(6):1175–1196, 2014. doi: 10.1007/s00158- 014-1107-x

work page doi:10.1007/s00158- 2014
[27]

, date =

Ole Sigmund. A 99 line topology optimization code written in Matlab.Structural and Multidisciplinary Optimization, 21(2):120–127, 2001. doi: 10.1007/s001580050176

work page doi:10.1007/s001580050176 2001
[28]

Lazarov, and Ole Sig- mund

Erik Andreassen, Anders Clausen, Mattias Schevenels, Boyan S. Lazarov, and Ole Sig- mund. Efficient topology optimization in MATLAB using 88 lines of code.Structural and Multidisciplinary Optimization, 43(1):1–16, 2011. doi: 10.1007/s00158-010-0594-7

work page doi:10.1007/s00158-010-0594-7 2011
[29]

Thomas J. R. Hughes, Itzhak Levit, and James Winget. An element-by-element solution algorithm for problems of structural and solid mechanics.Computer Methods in Applied Mechanics and Engineering, 36(2):241–254, 1983. doi: 10.1016/0045-7825(83)90115-9

work page doi:10.1016/0045-7825(83)90115-9 1983
[30]

Carey and Bo-nan Jiang

Graham F. Carey and Bo-nan Jiang. Element-by-element linear and nonlinear solution schemes.Communications in Applied Numerical Methods, 2(2):145–153, 1986. doi: 10. 1002/cnm.1630020205

1986
[31]

A 2589 line topology optimization code written for the graphics card.Computing and Visualization in Science, 14(6):249–256, 2011

Stephan Schmidt and Volker Schulz. A 2589 line topology optimization code written for the graphics card.Computing and Visualization in Science, 14(6):249–256, 2011. doi: 10.1007/s00791-012-0180-1

work page doi:10.1007/s00791-012-0180-1 2011
[32]

View-dependent streamline deformation and exploration

Jun Wu, Christian Dick, and Rüdiger Westermann. A system for high-resolution topology optimization.IEEE Transactions on Visualization and Computer Graphics, 22(3):1195– 1208, 2016. doi: 10.1109/TVCG.2015.2502588

work page doi:10.1109/tvcg.2015.2502588 2016
[33]

A generic interface for parallel cell-based finite element operator application.Computers & Fluids, 63:135–147, 2012

Martin Kronbichler and Katharina Kormann. A generic interface for parallel cell-based finite element operator application.Computers & Fluids, 63:135–147, 2012. doi: 10.1016/ j.compfluid.2012.04.012

2012
[34]

Fast matrix-free evaluation of discontinuous Galerkin finite element operators.ACM Transactions on Mathematical Software, 45(3): 29:1–29:40, 2019

Martin Kronbichler and Katharina Kormann. Fast matrix-free evaluation of discontinuous Galerkin finite element operators.ACM Transactions on Mathematical Software, 45(3): 29:1–29:40, 2019. doi: 10.1145/3325864

work page doi:10.1145/3325864 2019
[35]

Tzanio Kolev, Paul Fischer, Misun Min, Jack Dongarra, Jed Brown, Veselin Dobrev, Tim Warburton, Stanimire Tomov, Mark S. Shephard, Ahmad Abdelfattah, Valeria Barra, Na- talie Beams, Jean-Sylvain Camier, Noel Chalmers, Yohann Dudouit, Ali Karakus, Ian Karlin, Stefan Kerkemeier, Yu-Hsiang Lan, David Medina, Elia Merzari, Aleksandr Ob- abko, Will Pazner, Thi...

work page doi:10.1177/10943420211020803 2021
[36]

High-performance finite elements with MFEM.Inter- national Journal of High Performance Computing Applications, 38(5):447–467, 2024

Julian Andrej, Nabil Atallah, Jan-Phillip Bäcker, Jean-Sylvain Camier, Dylan Copeland, Veselin Dobrev, Yohann Dudouit, Tobias Duswald, Brendan Keith, Dohyun Kim, Tzanio Kolev, Boyan Lazarov, Ketan Mittal, Will Pazner, Socratis Petrides, Syun’ichi Shiraiwa, Mark Stowell, and Vladimir Tomov. High-performance finite elements with MFEM.Inter- national Journal...

work page doi:10.1177/10943420241261981 2024
[37]

Towards a higher roofline for matrix-vector multiplication in matrix-free HOSFEM, 2025

Zijian Cao, Qiao Sun, Tiangong Zhang, and Huiyuan Li. Towards a higher roofline for matrix-vector multiplication in matrix-free HOSFEM, 2025

2025
[38]

Optimizing CUDA code by kernel fusion: application on BLAS.Journal of Supercomputing, 71(10):3934–3957, 2015

Jiri Filipovič, Matouš Madzin, Jan Fousek, and Luděk Matyska. Optimizing CUDA code by kernel fusion: application on BLAS.Journal of Supercomputing, 71(10):3934–3957, 2015. doi: 10.1007/s11227-015-1483-z

work page doi:10.1007/s11227-015-1483-z 2015
[39]

Scalable kernel fusion for memory-bound GPU applications

Mohamed Wahib and Naoya Maruyama. Scalable kernel fusion for memory-bound GPU applications. InSC’14: Int. Conf. for High Performance Computing, Networking, Storage and Analysis, pages 191–202, 2014. doi: 10.1109/SC.2014.21

work page doi:10.1109/sc.2014.21 2014
[40]

Automated GPU kernel transformations in large- scale production stencil applications

Mohamed Wahib and Naoya Maruyama. Automated GPU kernel transformations in large- scale production stencil applications. InProc. 24th Int. Symposium on High-Performance Parallel and Distributed Computing (HPDC), pages 259–270, 2015. doi: 10.1145/2749246. 2749255

work page doi:10.1145/2749246 2015
[41]

Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines,

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. InProc. 34th ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), pages 519–530, 2013. doi: 10.1145/2499370.2462176

work page doi:10.1145/2499370.2462176 2013
[42]

Matrix Anal

Patrick Amestoy, Alfredo Buttari, Nicholas J. Higham, Jean-Yves L’Excellent, Theo Mary, and Bastien Vieublé. Five-precision GMRES-based iterative refinement.SIAM Journal on Matrix Analysis and Applications, 45(1):529–552, 2024. doi: 10.1137/23M1549079

work page doi:10.1137/23m1549079 2024
[43]

A study of BFLOAT16 for deep learning training, 2019

Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Baner- jee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudar- shan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, and Pradeep Dubey. A study of BFLOAT16 fo...

2019
[44]

M. A. Clark, R. Babich, K. Barros, R. Brower, and C. Rebbi. Solving lattice QCD systems of equations using mixed precision solvers on GPUs.Computer Physics Communications, 181(9):1517–1528, 2010. doi: 10.1016/j.cpc.2010.05.002

work page doi:10.1016/j.cpc.2010.05.002 2010
[45]

McCormick, Joseph Benzaken, and Rasmus Tamstorf

Steve F. McCormick, Joseph Benzaken, and Rasmus Tamstorf. Algebraic error analysis for mixed-precision multigrid solvers.SIAM Journal on Scientific Computing, 43(5):S392– S419, 2021. doi: 10.1137/20M1348571

work page doi:10.1137/20m1348571 2021
[46]

InProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis(Atlanta, GA, USA)(SC ’24)

Dechuang Yang, Yuxuan Zhao, Yiduo Niu, Weile Jia, En Shao, Weifeng Liu, Guangming Tan, and Zhou Jin. Mille-feuille: a tile-grained mixed-precision single-kernel conjugate gra- dientsolveronGPUs. InSC24: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16, 2024. doi: 10.1109/SC41406.2024.00064

work page doi:10.1109/sc41406.2024.00064 2024
[47]

Accelerating reduction and scan using tensor core units

Abdul Dakkak, Cheng Li, Jinjun Xiong, Isaac Gelado, and Wen-mei Hwu. Accelerating reduction and scan using tensor core units. InProc. ACM Int. Conf. on Supercomputing (ICS), pages 46–57, 2019. doi: 10.1145/3330345.3331057

work page doi:10.1145/3330345.3331057 2019
[48]

Hiroyuki Ootomo and Rio Yokota. Recovering single-precision accuracy from tensor cores while surpassing the FP32 theoretical peak performance.International Journal of High Per- formance Computing Applications, 36(4):475–491, 2022. doi: 10.1177/10943420221090256

work page doi:10.1177/10943420221090256 2022
[49]

DGEMM on integer matrix multi- plication unit.International Journal of High Performance Computing Applications, 38(4): 297–313, 2024

Hiroyuki Ootomo, Katsuhisa Ozaki, and Rio Yokota. DGEMM on integer matrix multi- plication unit.International Journal of High Performance Computing Applications, 38(4): 297–313, 2024. doi: 10.1177/10943420241239588. 39

work page doi:10.1177/10943420241239588 2024
[50]

ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor Cores,

Yuetao Chen, Kun Li, Yuhao Wang, Donglin Bai, Lei Wang, Lingxiao Ma, Liang Yuan, Yunquan Zhang, Ting Cao, and Mao Yang. ConvStencil: Transform stencil computation to matrix multiplication on tensor cores. InProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pages 333–347, 2024. doi: 10.1145/3627535.3638476

work page doi:10.1145/3627535.3638476 2024
[51]

InProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis(Atlanta, GA, USA)(SC ’24)

Yiwei Zhang, Kun Li, Liang Yuan, Jiawen Cheng, Yunquan Zhang, Ting Cao, and Mao Yang. LoRAStencil: Low-rank adaptation of stencil computation on tensor cores. In SC24: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–17, 2024. doi: 10.1109/SC41406.2024.00059

work page doi:10.1109/sc41406.2024.00059 2024
[52]

Acceleration of tensor-product operations with tensor cores.ACM Transactions on Parallel Computing, 11(4):15:1–15:24, 2024

Cu Cui. Acceleration of tensor-product operations with tensor cores.ACM Transactions on Parallel Computing, 11(4):15:1–15:24, 2024. doi: 10.1145/3695466

work page doi:10.1145/3695466 2024
[53]

Oded Amir, Niels Aage, and Boyan S. Lazarov. On multigrid-CG for efficient topology optimization.Structural and Multidisciplinary Optimization, 49(5):815–829, 2014. doi: 10.1007/s00158-013-1015-5

work page doi:10.1007/s00158-013-1015-5 2014
[54]

On the use of multigrid preconditioners for topology optimization.Structural and Multidisciplinary Optimization, 63(2):835–853, 2021

Darin Peetz and Ahmed Elbanna. On the use of multigrid preconditioners for topology optimization.Structural and Multidisciplinary Optimization, 63(2):835–853, 2021. doi: 10.1007/s00158-020-02750-w

work page doi:10.1007/s00158-020-02750-w 2021
[55]

International Journal for Numerical Methods in Engineering , volume =

Blaise Bourdin. Filters in topology optimization.International Journal for Numerical Methods in Engineering, 50(9):2143–2158, 2001. doi: 10.1002/nme.116

work page doi:10.1002/nme.116 2001
[56]

On projection methods, con- vergence and robust formulations in topology optimization.Struct

Fengwen Wang, Boyan Stefanov Lazarov, and Ole Sigmund. On projection methods, con- vergence and robust formulations in topology optimization.Struct. Multidiscipl. Optim., 43(6):767–784, 2011. doi: 10.1007/s00158-010-0602-y

work page doi:10.1007/s00158-010-0602-y 2011
[57]

Methods of conjugate gradients for solving linear systems.Journal of Research of the National Bureau of Standards, 49(6):409–436, 1952

Magnus R. Hestenes and Eduard Stiefel. Methods of conjugate gradients for solving linear systems.Journal of Research of the National Bureau of Standards, 49(6):409–436, 1952. doi: 10.6028/jres.049.044

work page doi:10.6028/jres.049.044 1952
[58]

SIAM, 2 edition, 2003

Yousef Saad.Iterative Methods for Sparse Linear Systems. SIAM, 2nd edition, 2003. doi: 10.1137/1.9780898718003

work page doi:10.1137/1.9780898718003 2003
[59]

An instruction roofline model for GPUs

Nan Ding and Samuel Williams. An instruction roofline model for GPUs. In2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pages 7–18, 2019. doi: 10.1109/PMBS49563.2019.00007

work page doi:10.1109/pmbs49563.2019.00007 2019
[60]

PyTopo3D: A Python framework for 3D SIMP-based topology optimization, 2025

Jihoon Kim and Namwoo Kang. PyTopo3D: A Python framework for 3D SIMP-based topology optimization, 2025. arXiv preprint

2025
[61]

Matrix Computations , ISBN =

Gene H. Golub and Charles F. Van Loan.Matrix Computations. Johns Hopkins University Press, 4th edition, 2013. ISBN 978-1-4214-0794-4. doi: 10.56021/9781421407944

work page doi:10.56021/9781421407944 2013
[62]

Higham.Accuracy and Stability of Numerical Algorithms

Nicholas J. Higham.Accuracy and Stability of Numerical Algorithms. SIAM, 2nd edition,
[63]

doi: 10.1137/1.9780898718027

work page doi:10.1137/1.9780898718027
[64]

NVIDIA A100 Tensor Core GPU

NVIDIA Corporation. NVIDIA A100 Tensor Core GPU. https://www.nvidia.com/en-us/ data-center/a100/, 2020. Product page; specifications table lists 80GB HBM2e and up to 2.039TB/s bandwidth on the A100 80GB SXM configuration; accessed 2026-04-17

2020
[65]

A Multigrid Tutorial, Second Edition

William L. Briggs, Van Emden Henson, and Steve F. McCormick.A Multigrid Tutorial: Second Edition. SIAM, 2nd edition, 2000. doi: 10.1137/1.9780898719505

work page doi:10.1137/1.9780898719505 2000
[66]

Oosterlee, and Anton Schüller.Multigrid

Ulrich Trottenberg, Cornelius W. Oosterlee, and Anton Schüller.Multigrid. Academic Press, 2001. ISBN 978-0-12-701070-0. 40

2001