How far does a random forest generalize from a 54-run LAMMPS+SPICA benchmark?

Dennis Alves Pedersen; F\'abio Andrijauskas; Paulo Henrique Leme Ramalho

arxiv: 2606.27695 · v1 · pith:ZYSB3AUCnew · submitted 2026-06-26 · 💻 cs.DC

How far does a random forest generalize from a 54-run LAMMPS+SPICA benchmark?

Dennis Alves Pedersen , Paulo Henrique Leme Ramalho , F\'abio Andrijauskas This is my paper

Pith reviewed 2026-06-29 03:32 UTC · model grok-4.3

classification 💻 cs.DC

keywords random forestperformance predictionLAMMPSMPI OpenMPHPC surrogategeneralizationmolecular dynamicshardware regimes

0 comments

The pith

Random forest surrogate from 54 LAMMPS runs ranks hybrid configurations correctly only inside the same hardware regime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether a random forest model trained on just 54 molecular dynamics runs can predict good hybrid MPI plus OpenMP settings for LAMMPS without running more jobs. The model learns from nine features about node counts, threads, and ratios to forecast loop time and internal timing breakdowns. Within the same regime, such as all single-node jobs or all multi-node jobs, it correctly orders which configurations run faster. Accuracy drops sharply when the test case crosses from one regime to another, such as from single-node to multi-node. This gives a practical map of where the cheap surrogate can be trusted to guide further tuning.

Core claim

Trained on 54 LAMMPS+SPICA runs spanning 18 hybrid configurations with three replications each, the random forest achieves 0.49 seconds mean absolute error on loop time in-sample, or 4 percent relative. Feature importance concentrates in topology variables like OpenMP thread count and MPI to OpenMP ratio, while raw node and core counts contribute less than 3 percent. Leave-one-dimension-out tests demonstrate that the model ranks configurations correctly when source and target stay inside one hardware regime (single-node, multi-node, or shared threading tier) but loses ranking power when they cross regime boundaries.

What carries the argument

The leave-one-dimension-out generalization procedure applied to the random forest regressor, which isolates hardware regime membership as the factor controlling prediction accuracy.

If this is right

The surrogate can be used to recommend high-performing configurations inside a known regime without additional cluster time.
It produces an interpretable map showing where its recommendations remain reliable.
Benchmark campaigns can be scoped to fewer runs by trusting the surrogate inside each regime.
Overall allocation budget for tuning hybrid setups on similar clusters can be reduced substantially.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Defining clear hardware regimes upfront could let similar small surrogates guide tuning on other workloads or clusters.
Active learning that adds runs only at regime boundaries might extend the trusted region without much extra cost.
Cluster operators could pre-compute regime maps for common workloads to advise users on safe surrogate use.

Load-bearing premise

The 54 runs across 18 configurations are representative of performance inside each hardware regime on this cluster and workload.

What would settle it

A new set of runs inside one regime where the surrogate ranks the configurations in the wrong order would show the claim does not hold.

read the original abstract

Selecting near-optimal hybrid MPI+OpenMP configurations for molecular dynamics workloads on modern HPC clusters has traditionally required exhaustive empirical benchmarking, consuming allocation budget proportional to the number of configurations evaluated. This work investigates whether a cold-start Random Forest surrogate, trained once on a small, structured benchmark dataset, can reliably predict execution performance and recommend high-performing configurations without further cluster runs. The training dataset comprises 54 LAMMPS+SPICA runs of the antimicrobial peptide Tritrpticin on a hydrated DOPC bilayer (4 354 coarse-grained beads), spanning 18 hybrid configurations on 1-8 AMD EPYC 7662 nodes of the Lovelace cluster at CENAPAD-SP, with three independent replications each. Nine topology and resource features feed five regressors that predict loop time and four internal LAMMPS timing fractions (Pair, Kspace, Comm, Modify). In-sample mean absolute error is 0.49 s on loop time (4.0 % relative). Feature importance localizes predictive signal in topology variables (OpenMP threads and MPI/OpenMP ratio dominate; raw node and core counts contribute under 3 %). Leave-one-dimension-out generalization reveals that accuracy is governed by hardware regime membership: within a common regime (single-node, multi-node, or shared threading tier) the surrogate ranks configurations correctly, and degrades when targets cross architectural boundaries. The result is an interpretable map of where the surrogate's recommendations can be trusted, useful for scoping further benchmark campaigns at a fraction of their nominal cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RF surrogate ranks configs inside hardware regimes on this 54-run LAMMPS set but the 18-config sample is too thin to make the ranking claim robust.

read the letter

The main thing here is that a random forest trained on 54 LAMMPS+SPICA runs can rank hybrid MPI+OpenMP setups correctly when the target stays inside the same regime (single-node, multi-node, or shared threading) on this AMD EPYC cluster, but accuracy drops when crossing boundaries. They collected the data themselves—18 configurations with three replications each on the Lovelace cluster for one peptide-bilayer system—and used nine topology features to predict loop time plus four internal timers. In-sample MAE comes in at 0.49 s (4 % relative), feature importance puts the MPI/OpenMP ratio and thread count at the top, and the leave-one-dimension-out test produces the regime map.

The new piece is the concrete benchmark set plus the explicit check on where the surrogate stops working. That map is practical for anyone who wants to cut down full-cluster runs when tuning similar workloads.

The soft spot is the data volume. Eighteen configurations total means each regime gets only a handful of points, and three replications per config does not give much room to check variance or rule out sampling artifacts. The abstract gives no per-regime counts or statistical test against a baseline, so the ranking result could be fragile. Scope is also narrow—one workload, one cluster—so the regime boundaries are not shown to travel.

This is for people who tune LAMMPS on comparable hardware or who build cheap surrogates for HPC performance. The empirical data and the regime observation are worth a referee's time even if the generalization needs tighter validation on sample size.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that a Random Forest surrogate trained on a 54-run LAMMPS+SPICA benchmark (18 hybrid MPI+OpenMP configurations on 1-8 nodes, three replications each) achieves an in-sample MAE of 0.49 s (4 % relative) for loop time and that leave-one-dimension-out validation shows generalization is governed by hardware regime membership: the model ranks configurations correctly within single-node, multi-node, or shared-threading regimes but degrades across boundaries, yielding an interpretable trust map for surrogate use.

Significance. If the regime-dependent generalization result holds, the work supplies a practical, low-cost method for scoping exhaustive benchmark campaigns on HPC clusters for molecular-dynamics workloads. The empirical, non-circular training on independent runs and the localization of predictive signal to topology variables (OpenMP threads and MPI/OpenMP ratio) are explicit strengths that increase the result's utility.

major comments (2)

[Abstract] Abstract (leave-one-dimension-out generalization paragraph): the central claim that the surrogate 'ranks configurations correctly' inside each regime rests on the assumption that the 18 configurations already provide representative coverage of the performance surface within each regime; the manuscript reports neither per-regime configuration counts nor replication variances nor a statistical test against a baseline ranking, leaving open the possibility that observed within-regime accuracy is an artifact of sparse sampling.
[Methods] Methods (data-split and validation subsection): the exact definition of the 'dimensions' in leave-one-dimension-out, the number of folds per regime, and the hyper-parameter settings of the five regressors are not stated; these details are load-bearing for reproducing and confirming that regime membership, rather than other factors, governs the reported accuracy drop.

minor comments (2)

[Abstract] The abstract states 'five regressors' but does not clarify whether these are independent Random Forests or a multi-output model; adding this sentence would improve clarity without affecting the central claim.
Table or figure presenting the 18 configurations and their replication times would allow readers to verify the per-regime sample sizes directly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight areas where additional clarity will improve the manuscript. We address each major comment below and will revise accordingly to include the requested details on validation and statistics.

read point-by-point responses

Referee: [Abstract] Abstract (leave-one-dimension-out generalization paragraph): the central claim that the surrogate 'ranks configurations correctly' inside each regime rests on the assumption that the 18 configurations already provide representative coverage of the performance surface within each regime; the manuscript reports neither per-regime configuration counts nor replication variances nor a statistical test against a baseline ranking, leaving open the possibility that observed within-regime accuracy is an artifact of sparse sampling.

Authors: We agree that the abstract does not explicitly report per-regime configuration counts, replication variances, or a baseline comparison. The manuscript states there are 18 configurations with three replications but does not break them down by regime in the abstract. In revision we will add these details (per-regime counts, standard deviations from replications, and a short note comparing ranking performance to a mean baseline) to the abstract and main text, confirming that the within-regime ranking holds under the reported sampling. revision: yes
Referee: [Methods] Methods (data-split and validation subsection): the exact definition of the 'dimensions' in leave-one-dimension-out, the number of folds per regime, and the hyper-parameter settings of the five regressors are not stated; these details are load-bearing for reproducing and confirming that regime membership, rather than other factors, governs the reported accuracy drop.

Authors: We acknowledge these details are missing from the Methods section. The dimensions correspond to the three hardware regimes (single-node, multi-node, shared-threading); leave-one-dimension-out uses three folds, each omitting one regime. Hyperparameters for the five regressors follow scikit-learn defaults with the Random Forest using 100 estimators and other settings as standard. We will add a dedicated paragraph in the revised Methods section with these definitions, fold counts, and hyperparameter values. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical ML surrogate on independent benchmark data.

full rationale

The paper trains a standard Random Forest on 54 independent empirical LAMMPS+SPICA runs (18 configurations × 3 replications) and evaluates generalization via leave-one-dimension-out. No equations, ansatzes, or self-citations reduce the reported performance rankings or errors to quantities defined by the same fitted parameters. Feature importances and regime-based accuracy claims are direct outputs of the trained model on held-out data, with no self-definitional loops or fitted-inputs-called-predictions. The derivation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical performance of the random forest on the 54-run dataset and the interpretation of leave-one-dimension-out results as evidence of regime-dependent generalization. No free parameters, axioms, or invented entities are explicitly introduced beyond standard supervised learning assumptions.

axioms (1)

domain assumption The nine topology and resource features are sufficient to capture the dominant sources of performance variation within each hardware regime
These features are used to train the five regressors; the claim that they localize predictive signal is central to the feature-importance analysis.

pith-pipeline@v0.9.1-grok · 5817 in / 1281 out tokens · 56899 ms · 2026-06-29T03:32:47.774195+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 2 canonical work pages

[1]

Strategies for Molecular Dynamics using Hybrid Systems: LAMMPS Use Case., jun

Disponível em: <http://dx.doi.org/10.1145/3731599.3767498> RAMALHO, Paulo Henrique Leme; PEDERSEN, Dennis Alves; ANDRIJAUSKAS, Fábio. Strategies for Molecular Dynamics using Hybrid Systems: LAMMPS Use Case., jun

work page doi:10.1145/3731599.3767498
[2]

Parallel Sparse Matrix-Vector Multiplication as a Test Case for Hybrid MPI+OpenMP Programming

Disponível em: <https://arxiv.org/abs/2606.02319> SCHUBERT, Gerald et al. Parallel Sparse Matrix-Vector Multiplication as a Test Case for Hybrid MPI+OpenMP Programming. In: IEEE,

Pith/arXiv arXiv
[3]

SPICA Force Field for Lipid Membranes: Domain Formation Induced by Cholesterol

Disponível em: <http://dx.doi.org/10.1109/IPDPS.2011.332> SEO, Sangjae; SHINODA, Wataru. SPICA Force Field for Lipid Membranes: Domain Formation Induced by Cholesterol. Journal of Chemical Theory and Computation, v. 15, n. 1, p. 762–774, dez

work page doi:10.1109/ipdps.2011.332 2011

[1] [1]

Strategies for Molecular Dynamics using Hybrid Systems: LAMMPS Use Case., jun

Disponível em: <http://dx.doi.org/10.1145/3731599.3767498> RAMALHO, Paulo Henrique Leme; PEDERSEN, Dennis Alves; ANDRIJAUSKAS, Fábio. Strategies for Molecular Dynamics using Hybrid Systems: LAMMPS Use Case., jun

work page doi:10.1145/3731599.3767498

[2] [2]

Parallel Sparse Matrix-Vector Multiplication as a Test Case for Hybrid MPI+OpenMP Programming

Disponível em: <https://arxiv.org/abs/2606.02319> SCHUBERT, Gerald et al. Parallel Sparse Matrix-Vector Multiplication as a Test Case for Hybrid MPI+OpenMP Programming. In: IEEE,

Pith/arXiv arXiv

[3] [3]

SPICA Force Field for Lipid Membranes: Domain Formation Induced by Cholesterol

Disponível em: <http://dx.doi.org/10.1109/IPDPS.2011.332> SEO, Sangjae; SHINODA, Wataru. SPICA Force Field for Lipid Membranes: Domain Formation Induced by Cholesterol. Journal of Chemical Theory and Computation, v. 15, n. 1, p. 762–774, dez

work page doi:10.1109/ipdps.2011.332 2011