A Matrix-Free Galerkin Multigrid Solver and Failure-Mode Screen for Single-GPU 3D SIMP Linear Systems

Jun Wang, Shaoliang Yang, Yunsheng Wang

Pith reviewed 2026-05-07 12:08 UTC · model grok-4.3

classification 💻 cs.CE cs.NAmath.NA

keywords galerkiniterationmatrix-freescreenbf16fp32-gmgheterogeneouslevel

0 comments

The pith

Matrix-free Galerkin GMG solver for single-GPU 3D SIMP elasticity achieves 1.62x-3.12x speedups over Jacobi-PCG with pass rates of 7/9 to 1/9 at 64k-512k elements and a diagnostic spectral failure screen.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Topology optimization with the SIMP method requires many solves of 3D elasticity equations as material density varies across the design domain. This work targets the linear solver by building a geometric multigrid hierarchy around a fused matrix-free fine-level operator. The first coarse level is assembled via local Galerkin aggregation and deeper levels use sparse Galerkin products. Tests cover 27 heterogeneous cantilever cases at three sizes and uniform density cases. FP32-GMG shows wall-time speedups of 1.62x to 3.12x versus a capped Jacobi-PCG baseline that hits iteration limits. BF16 mixed precision is tested as a diagnostic but does not improve speed. A spectral screen is introduced to flag likely non-convergence within a 200-iteration budget. The largest uniform solve reported is 1M elements in roughly 1.5 seconds.

Core claim

In a 27-case heterogeneous cantilever sweep, pass rates under a 200-iteration budget are 7/9, 4/9, and 1/9 at 64k, 216k, and 512k elements; FP32-GMG gives 1.62x, 1.75x, and 3.12x wall-time ratios relative to the capped flat Jacobi-PCG baseline; the largest reported solve is a 1M-element uniform-modulus system solved in 1.50+/-0.58 s.

Load-bearing premise

That the fused matrix-free fine operator combined with local Galerkin aggregation and sparse coarse products yields a stable hierarchy whose convergence behavior remains predictable for the high-contrast, density-dependent operators typical of heterogeneous 3D SIMP problems.

read the original abstract

Large 3D SIMP studies require repeated elasticity solves for density-dependent operators whose finest matrices are expensive to assemble and whose conditioning degrades under high contrast. We study this linear-solver layer rather than claiming end-to-end optimization acceleration. The solver builds a matrix-free Galerkin geometric multigrid (GMG) hierarchy around a fused fine operator: the finest level remains matrix-free, the first coarse level is assembled by local Galerkin aggregation, and deeper levels use sparse Galerkin products. The practical default is FP32-GMG; BF16 is evaluated as a guarded mixed-precision variant and diagnostic stress test, not as the main speed mechanism. In a 27-case heterogeneous cantilever sweep, pass rates under a 200-iteration budget are 7/9, 4/9, and 1/9 at 64k, 216k, and 512k elements; converged-only mean iteration counts are about 112, 134, and 146. On uniform rho=0.5, p=3 solves, FP32-GMG gives 1.62x, 1.75x, and 3.12x wall-time ratios relative to the capped flat Jacobi-PCG baseline at the same sizes; that non-converged baseline reaches the 200-iteration cap in all timed trials. BF16-GMG is not faster than FP32-GMG. In 18 fixed-seed heterogeneous BF16 validation cases, 7/18 converge, matching the FP64 count, and 11 cases that pass the spectral screen still fail the 500-iteration cap; the screen is therefore diagnostic rather than a convergence certificate. The largest reported solve is a 1M-element uniform-modulus system solved in 1.50+/-0.58 s with an 8.66 GiB hierarchy-allocation delta during setup, not a peak-memory trace; this point is reported as uniform scaling, not heterogeneous robustness evidence. The contribution is therefore a bounded single-GPU solver result built on an inherited Level 0 matrix-free operator: a Galerkin GMG hierarchy, direct BF16 guard evidence, and an explicit failure-mode screen for structured 3D SIMP linear systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a matrix-free Galerkin geometric multigrid (GMG) solver for the large, density-dependent linear elasticity systems that arise in 3D SIMP topology optimization on a single GPU. The hierarchy keeps the finest level matrix-free via a fused operator, assembles the first coarse level by local Galerkin aggregation, and uses sparse Galerkin products on deeper levels. FP32-GMG is the practical default; BF16-GMG is evaluated only as a guarded mixed-precision diagnostic and stress test. Empirical results are reported on uniform-modulus and heterogeneous cantilever problems, including pass rates under iteration budgets, mean iteration counts for converged cases, wall-time ratios versus a capped Jacobi-PCG baseline, and a spectral failure-mode screen. The largest reported solve is a 1 M-element uniform system solved in 1.50 +/- 0.58 s.

Significance. If the bounded performance and diagnostic screen hold, the work supplies a practical, memory-efficient single-GPU solver layer that could support larger-scale 3D SIMP studies by reducing the cost of repeated elasticity solves. Concrete strengths include the direct reporting of pass rates (7/9, 4/9, 1/9 across mesh sizes in the 27-case heterogeneous sweep), converged-only iteration counts (approximately 112-146), and timing ratios (1.62x-3.12x on uniform problems), together with the explicit failure-mode screen and matrix-free design that avoids full fine-level assembly.

major comments (2)

Heterogeneous cantilever sweep results: pass rates under the 200-iteration budget fall from 7/9 at 64 k elements to 1/9 at 512 k elements, while converged-only mean iterations rise from ~112 to 146. This degradation directly tests the central assumption that the fused matrix-free fine operator plus local Galerkin aggregation produces a stable hierarchy whose convergence remains predictable for high-contrast, density-dependent 3D SIMP operators; the observed trend indicates that robustness is not yet demonstrated for the target regime and requires either mitigation or clearer bounding of the method's applicability.
Baseline and validation comparisons: speedups are reported only relative to a flat Jacobi-PCG baseline that reaches the 200-iteration cap in all timed trials, and BF16 validation shows that 11 of 18 cases passing the spectral screen still fail a 500-iteration cap. Additional comparisons to established GPU multigrid or algebraic-multigrid solvers, together with a quantitative error analysis (e.g., residual histories or eigenvalue bounds), are needed to substantiate the performance and diagnostic claims.

minor comments (2)

The memory figure is given as an 8.66 GiB hierarchy-allocation delta during setup rather than a peak-memory trace; clarifying this distinction and reporting peak usage would improve reproducibility and practical assessment.
Implementation specifics of the fused fine operator and the local aggregation procedure are only sketched; expanding these descriptions (e.g., with pseudocode or explicit stencil operations) would aid readers attempting to reproduce or extend the solver.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive review and for acknowledging the matrix-free design, direct pass-rate reporting, and the explicit failure-mode screen. We address each major comment point by point below, with honest proposals for revision where feasible.

read point-by-point responses

Referee: Heterogeneous cantilever sweep results: pass rates under the 200-iteration budget fall from 7/9 at 64 k elements to 1/9 at 512 k elements, while converged-only mean iterations rise from ~112 to 146. This degradation directly tests the central assumption that the fused matrix-free fine operator plus local Galerkin aggregation produces a stable hierarchy whose convergence remains predictable for high-contrast, density-dependent 3D SIMP operators; the observed trend indicates that robustness is not yet demonstrated for the target regime and requires either mitigation or clearer bounding of the method's applicability.

Authors: We agree that the reported degradation in pass rates (7/9 to 1/9) and the modest rise in converged iteration counts (~112 to 146) with mesh size in the heterogeneous sweep demonstrates limits to robustness for high-contrast cases. The manuscript already positions the result as bounded rather than universally robust, and the spectral screen is presented as a diagnostic (not a convergence guarantee), as shown by the BF16 validation where 11 screen-passing cases fail the 500-iteration cap. As a partial revision we will expand the discussion and conclusions to explicitly bound applicability to screen-passing regimes and to frame the observed trend as a limitation of the current hierarchy for larger heterogeneous problems. revision: partial
Referee: Baseline and validation comparisons: speedups are reported only relative to a flat Jacobi-PCG baseline that reaches the 200-iteration cap in all timed trials, and BF16 validation shows that 11 of 18 cases passing the spectral screen still fail a 500-iteration cap. Additional comparisons to established GPU multigrid or algebraic-multigrid solvers, together with a quantitative error analysis (e.g., residual histories or eigenvalue bounds), are needed to substantiate the performance and diagnostic claims.

Authors: The flat Jacobi-PCG baseline was selected as a standard, assembly-free single-GPU reference that consistently hits the 200-iteration cap, making the 1.62x–3.12x wall-time ratios a direct measure of practical advantage. The BF16 results are used specifically to establish that the spectral screen is diagnostic rather than predictive. We will add residual-norm histories for representative cases as quantitative error analysis in a partial revision. However, we cannot provide comparisons to other established GPU multigrid or AMG solvers without substantial new implementation and experiments. revision: partial

standing simulated objections not resolved

Additional comparisons to established GPU multigrid or algebraic-multigrid solvers

Circularity Check

0 steps flagged

No significant circularity; all claims rest on direct empirical measurements

full rationale

The paper reports wall-time ratios, iteration counts, and pass/fail rates from explicit runs of the described FP32-GMG and BF16-GMG solvers on fixed cantilever meshes (64k–1M elements) and uniform-modulus cases. These quantities are measured outcomes, not quantities derived from equations that are then re-used as inputs. The text explicitly frames the contribution as 'a bounded single-GPU solver result built on an inherited Level 0 matrix-free operator' together with 'direct BF16 guard evidence' and 'an explicit failure-mode screen'; no first-principles derivation, uniqueness theorem, or fitted-parameter prediction is invoked whose validity would collapse back to the same data or self-citation. The observed degradation in heterogeneous pass rates is presented as a diagnostic finding rather than a theoretically guaranteed property, so the empirical record itself serves as the evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach inherits a Level-0 matrix-free operator and relies on standard multigrid theory; no new free parameters, ad-hoc axioms, or invented entities are introduced in the abstract.

axioms (1)

domain assumption Galerkin coarse-grid operators provide sufficiently accurate corrections for the fine-grid problem in the multigrid hierarchy
Invoked when constructing the first coarse level by local Galerkin aggregation and deeper levels by sparse products.

pith-pipeline@v0.9.0 · 5729 in / 1435 out tokens · 92153 ms · 2026-05-07T12:08:41.781899+00:00 · methodology

A Matrix-Free Galerkin Multigrid Solver and Failure-Mode Screen for Single-GPU 3D SIMP Linear Systems

Core claim

Load-bearing premise

discussion (0)