pith. sign in

arxiv: 2605.19939 · v2 · pith:GDPWG2AOnew · submitted 2026-05-19 · 💻 cs.CE

Uncertainty-aware Machine Learning Interatomic Potentials via Learned Functional Perturbations

Pith reviewed 2026-05-20 04:22 UTC · model grok-4.3

classification 💻 cs.CE
keywords uncertainty quantificationmachine learning interatomic potentialsequivariant graph neural networkscontinuous ranked probability scorefunctional perturbationsout-of-distribution predictionsilicaN-body benchmark
0
0 comments X

The pith

Machine learning interatomic potentials gain reliable uncertainty estimates by adding learned functional perturbations and training with the continuous ranked probability score.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to make existing machine learning interatomic potentials uncertainty-aware in a simple way that avoids ensembles or extra hyperparameters. It introduces learned functional perturbations to the deterministic output and finetunes the model end-to-end using the continuous ranked probability score as the training objective. This matters for simulations and active learning because it lets users know when predictions are likely to fail on new atomic arrangements. Tests on charged-particle systems and silica structures show the approach produces uncertainties that align better with actual errors than prior Bayesian methods.

Core claim

A deterministic MLIP becomes probabilistic when its predictions receive learned functional perturbations that are optimized jointly with the continuous ranked probability score, producing calibrated uncertainty estimates that improve correlation with true errors on out-of-distribution configurations.

What carries the argument

Learned functional perturbations, which modify the model's output function during end-to-end CRPS training to encode predictive uncertainty.

If this is right

  • Active learning for MLIPs can select new training structures more efficiently by using the uncertainty signal.
  • Molecular dynamics simulations become safer because high-uncertainty regions can trigger fallback to more expensive calculations.
  • Foundation models for materials can be turned uncertainty-aware through the same finetuning procedure without redesign.
  • The method applies equally to models trained from scratch and to large pretrained potentials.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same perturbation idea may extend to other scientific machine-learning models that currently lack built-in uncertainty.
  • Combining these perturbations with selective ensemble averaging could further tighten calibration on rare events.
  • The approach might reduce the data needed for reliable potentials by guiding data collection toward uncertain regions.

Load-bearing premise

Learned functional perturbations, when optimized with CRPS, can represent the uncertainty of atomic configurations that lie outside the training distribution.

What would settle it

A new test set of atomic configurations with errors that do not increase in line with the model's reported uncertainty, or CRPS scores that fail to beat the Bayesian baseline.

Figures

Figures reproduced from arXiv: 2605.19939 by Dario Coscia, David R. Wessels, Erik J. Bekkers, Maksim Zhdanov, Olga Zaghen.

Figure 1
Figure 1. Figure 1: N-body test performance vs. training size (mean ± std over 4 seeds, shaded). Left: MSE; Center: CRPS; Right: spread-to￾skill ratio SSR. P-EGNN consistently achieves the best CRPS and the SSR closest to 1 at every training size, with the calibration gap widening as n grows. 32 128 1024 2 4 6 32 128 1024 2 4 6 Training set size n 32 128 1024 0.2 0.4 0.6 0.8 1 F-MAE[meV/A] ˚ ↓ F-CRPS[meV/A] ˚ ↓ F-Spear ↑ (ide… view at source ↗
read the original abstract

Machine Learning Interatomic Potentials (MLIPs) achieve near ab initio accuracy at a fraction of the cost of quantum-mechanical simulations, yet they remain prone to silent failures on out-of-distribution configurations, making principled uncertainty quantification (UQ) essential for error-aware simulations and active learning. Existing non-ensemble UQ methods for MLIPs rely either on variational inference or on parametric distributional assumptions, both of which add architectural complexity and hyper-parameters that must be tuned per task. Inspired by recent advances in probabilistic weather forecasting, we propose a simpler alternative: turn a deterministic MLIP into a probabilistic one through learned functional perturbations and finetune it end-to-end with the Continuous Ranked Probability Score (CRPS), a proper scoring rule. We validate the approach with an equivariant GNN (P-EGNN) trained from scratch and by finetuning the foundation model the Orb-v3 for silica. On the N-body charged particle benchmark, P-EGNN improves CRPS over the state-of-the-art Bayesian MLIP method BLIP by 19-32% across all training sizes; on silica, P-Orb raises the Spearman correlation between predicted uncertainty and actual error from 0.75 (BLIP-Orb) to 0.84.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes turning deterministic ML interatomic potentials into probabilistic models by introducing learned functional perturbations and end-to-end finetuning with the Continuous Ranked Probability Score (CRPS). It reports that P-EGNN improves CRPS by 19-32% over BLIP on the N-body charged-particle benchmark across training sizes, and that P-Orb raises the Spearman correlation between predicted uncertainty and error from 0.75 (BLIP-Orb) to 0.84 on silica.

Significance. If the central results hold under full verification, the approach supplies a lower-complexity alternative to variational or ensemble UQ for MLIPs, which could simplify reliable active learning and error-aware molecular dynamics. The concrete benchmark gains (CRPS and Spearman lifts) constitute a clear, falsifiable advance worth testing on additional OOD regimes.

major comments (1)
  1. The claim that learned functional perturbations, when finetuned with CRPS, adequately represent epistemic uncertainty on out-of-distribution atomic configurations rests on the N-body and silica results; however, the manuscript provides insufficient detail on data splits, OOD construction, and verification that the observed gains (19-32% CRPS, 0.75 to 0.84 Spearman) arise from epistemic rather than in-distribution calibration improvements.
minor comments (1)
  1. The abstract and methods description omit explicit statements of the perturbation parameterization, the precise form of the CRPS loss, and the training protocol for P-Orb finetuning; adding these would strengthen reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the work's potential impact. We address the major comment below and will incorporate clarifications and additional details in the revised manuscript.

read point-by-point responses
  1. Referee: The claim that learned functional perturbations, when finetuned with CRPS, adequately represent epistemic uncertainty on out-of-distribution atomic configurations rests on the N-body and silica results; however, the manuscript provides insufficient detail on data splits, OOD construction, and verification that the observed gains (19-32% CRPS, 0.75 to 0.84 Spearman) arise from epistemic rather than in-distribution calibration improvements.

    Authors: We agree that greater explicitness on data splits and OOD construction will strengthen the presentation. In the revised manuscript we will add a dedicated paragraph in the Experiments section that specifies: (i) for the N-body benchmark, training configurations are generated with 5–10 particles while test sets include systems with 15–20 particles to induce controlled distributional shift; (ii) for the silica benchmark, the OOD subset is constructed from trajectories at temperatures and defect densities outside the training distribution. On the epistemic-versus-calibration question, the functional perturbations are introduced precisely to allow the model to express epistemic variability in the learned potential; CRPS training then optimizes the entire predictive distribution under this variability. The reported CRPS gains are measured on the shifted test distributions, and the Spearman improvement quantifies better ranking of actual errors by the predicted uncertainty—precisely the behavior expected when epistemic uncertainty is better captured. We will include a short discussion of this distinction together with a supplementary plot of uncertainty–error correlation stratified by in-distribution versus OOD subsets. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new perturbations and CRPS objective are independent of inputs

full rationale

The paper's central derivation introduces learned functional perturbations applied to a deterministic MLIP, followed by end-to-end finetuning using the CRPS proper scoring rule. This construction does not reduce by definition or by the paper's equations to any previously fitted parameter or self-citation chain. The reported gains (CRPS improvements of 19-32% on N-body, Spearman lift from 0.75 to 0.84 on silica) are presented as empirical outcomes of the new training procedure rather than tautological renamings or fitted-input predictions. No self-definitional steps, uniqueness theorems imported from the same authors, or ansatz smuggling via prior work appear in the provided text. The approach is self-contained against external benchmarks and does not rely on load-bearing self-citations for its core claim.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

Abstract-only access limits visibility into exact parameter counts or background assumptions; the primary addition appears to be the perturbation mechanism itself.

free parameters (1)
  • learned perturbation parameters
    Parameters introduced to create functional perturbations; their number and initialization are not specified in the abstract.
invented entities (1)
  • learned functional perturbations no independent evidence
    purpose: To convert a deterministic MLIP into a probabilistic model without architectural redesign
    Introduced as the core mechanism to enable uncertainty quantification via end-to-end training.

pith-pipeline@v0.9.0 · 5767 in / 1044 out tokens · 39942 ms · 2026-05-20T04:22:16.238849+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.