Mass matrix assembly for implicit PIC methods can be exactly reformulated cell-by-cell as tensor-core matrix products, delivering up to 3x kernel speedup and 15% end-to-end runtime reduction in ECSIM simulations.
Acceleration of tensor-product operations with tensor cores.ACM Transactions on Parallel Computing, 11(4):15:1–15:24
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CE 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
A fused gather-GEMM-scatter CUDA kernel achieves 4.6-7.3x end-to-end speedup and 3.2-4.9x lower energy for matrix-free 3D SIMP topology optimization on RTX 4090 compared to three-stage baselines.
citing papers explorer
-
Mass Matrix Assembly on Tensor Cores for Implicit Particle-In-Cell Methods
Mass matrix assembly for implicit PIC methods can be exactly reformulated cell-by-cell as tensor-core matrix products, delivering up to 3x kernel speedup and 15% end-to-end runtime reduction in ECSIM simulations.
-
Matrix-Free 3D SIMP Topology Optimization with Fused Gather-GEMM-Scatter Kernels
A fused gather-GEMM-scatter CUDA kernel achieves 4.6-7.3x end-to-end speedup and 3.2-4.9x lower energy for matrix-free 3D SIMP topology optimization on RTX 4090 compared to three-stage baselines.