Uncertainty-aware Machine Learning Interatomic Potentials via Learned Functional Perturbations

Dario Coscia; David R. Wessels; Erik J. Bekkers; Maksim Zhdanov; Olga Zaghen

arxiv: 2605.19939 · v2 · pith:GDPWG2AOnew · submitted 2026-05-19 · 💻 cs.CE

Uncertainty-aware Machine Learning Interatomic Potentials via Learned Functional Perturbations

Olga Zaghen , Maksim Zhdanov , Dario Coscia , David R. Wessels , Erik J. Bekkers This is my paper

Pith reviewed 2026-05-20 04:22 UTC · model grok-4.3

classification 💻 cs.CE

keywords uncertainty quantificationmachine learning interatomic potentialsequivariant graph neural networkscontinuous ranked probability scorefunctional perturbationsout-of-distribution predictionsilicaN-body benchmark

0 comments

The pith

Machine learning interatomic potentials gain reliable uncertainty estimates by adding learned functional perturbations and training with the continuous ranked probability score.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to make existing machine learning interatomic potentials uncertainty-aware in a simple way that avoids ensembles or extra hyperparameters. It introduces learned functional perturbations to the deterministic output and finetunes the model end-to-end using the continuous ranked probability score as the training objective. This matters for simulations and active learning because it lets users know when predictions are likely to fail on new atomic arrangements. Tests on charged-particle systems and silica structures show the approach produces uncertainties that align better with actual errors than prior Bayesian methods.

Core claim

A deterministic MLIP becomes probabilistic when its predictions receive learned functional perturbations that are optimized jointly with the continuous ranked probability score, producing calibrated uncertainty estimates that improve correlation with true errors on out-of-distribution configurations.

What carries the argument

Learned functional perturbations, which modify the model's output function during end-to-end CRPS training to encode predictive uncertainty.

If this is right

Active learning for MLIPs can select new training structures more efficiently by using the uncertainty signal.
Molecular dynamics simulations become safer because high-uncertainty regions can trigger fallback to more expensive calculations.
Foundation models for materials can be turned uncertainty-aware through the same finetuning procedure without redesign.
The method applies equally to models trained from scratch and to large pretrained potentials.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same perturbation idea may extend to other scientific machine-learning models that currently lack built-in uncertainty.
Combining these perturbations with selective ensemble averaging could further tighten calibration on rare events.
The approach might reduce the data needed for reliable potentials by guiding data collection toward uncertain regions.

Load-bearing premise

Learned functional perturbations, when optimized with CRPS, can represent the uncertainty of atomic configurations that lie outside the training distribution.

What would settle it

A new test set of atomic configurations with errors that do not increase in line with the model's reported uncertainty, or CRPS scores that fail to beat the Bayesian baseline.

Figures

Figures reproduced from arXiv: 2605.19939 by Dario Coscia, David R. Wessels, Erik J. Bekkers, Maksim Zhdanov, Olga Zaghen.

**Figure 1.** Figure 1: N-body test performance vs. training size (mean ± std over 4 seeds, shaded). Left: MSE; Center: CRPS; Right: spread-toskill ratio SSR. P-EGNN consistently achieves the best CRPS and the SSR closest to 1 at every training size, with the calibration gap widening as n grows. 32 128 1024 2 4 6 32 128 1024 2 4 6 Training set size n 32 128 1024 0.2 0.4 0.6 0.8 1 F-MAE[meV/A] ˚ ↓ F-CRPS[meV/A] ˚ ↓ F-Spear ↑ (ide… view at source ↗

read the original abstract

Machine Learning Interatomic Potentials (MLIPs) achieve near ab initio accuracy at a fraction of the cost of quantum-mechanical simulations, yet they remain prone to silent failures on out-of-distribution configurations, making principled uncertainty quantification (UQ) essential for error-aware simulations and active learning. Existing non-ensemble UQ methods for MLIPs rely either on variational inference or on parametric distributional assumptions, both of which add architectural complexity and hyper-parameters that must be tuned per task. Inspired by recent advances in probabilistic weather forecasting, we propose a simpler alternative: turn a deterministic MLIP into a probabilistic one through learned functional perturbations and finetune it end-to-end with the Continuous Ranked Probability Score (CRPS), a proper scoring rule. We validate the approach with an equivariant GNN (P-EGNN) trained from scratch and by finetuning the foundation model the Orb-v3 for silica. On the N-body charged particle benchmark, P-EGNN improves CRPS over the state-of-the-art Bayesian MLIP method BLIP by 19-32% across all training sizes; on silica, P-Orb raises the Spearman correlation between predicted uncertainty and actual error from 0.75 (BLIP-Orb) to 0.84.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adapts functional perturbations from weather models to MLIPs, trains with CRPS, and reports clear gains over BLIP on the tested benchmarks, though OOD epistemic uncertainty capture is the open question.

read the letter

The main point is that learned functional perturbations plus end-to-end CRPS training turn a deterministic equivariant GNN into a probabilistic MLIP with lower added complexity than variational or ensemble routes, and the numbers on the N-body and silica cases look better than BLIP. On the charged-particle benchmark the CRPS drops 19-32% across training sizes; on silica the Spearman correlation between predicted uncertainty and actual error rises from 0.75 to 0.84. That is the concrete result worth noting first. The approach is new in its specific combination: taking the perturbation idea from probabilistic forecasting, applying it to P-EGNN trained from scratch and to Orb-v3 finetuning, and using CRPS as the objective instead of a parametric likelihood. The paper keeps the architectural overhead small—just extra learned perturbation parameters—and shows the method works on both a small benchmark and a foundation-model setting. Those are the parts that hold up cleanly on the reported evidence. The softer spot is the claim that the perturbations adequately represent epistemic uncertainty on truly out-of-distribution atomic configurations. The benchmarks are useful but limited; gains could still come from improved in-distribution calibration rather than reliable behavior on unseen bonding environments, defects, or extreme conditions. Without additional OOD stress tests or comparison to explicit ensemble baselines on those regimes, it is hard to judge how far the simplification travels. The citation pattern to BLIP and the weather-forecasting literature is straightforward and does not look circular. This work is for groups already running MLIPs who want a lighter-weight uncertainty option for active learning or error-aware simulations. A reader who needs practical UQ without per-task hyper-parameter tuning will get the most out of the quantitative comparisons. It is worth sending to peer review so the methods and any extra validation can be checked in detail.

Referee Report

1 major / 1 minor

Summary. The paper proposes turning deterministic ML interatomic potentials into probabilistic models by introducing learned functional perturbations and end-to-end finetuning with the Continuous Ranked Probability Score (CRPS). It reports that P-EGNN improves CRPS by 19-32% over BLIP on the N-body charged-particle benchmark across training sizes, and that P-Orb raises the Spearman correlation between predicted uncertainty and error from 0.75 (BLIP-Orb) to 0.84 on silica.

Significance. If the central results hold under full verification, the approach supplies a lower-complexity alternative to variational or ensemble UQ for MLIPs, which could simplify reliable active learning and error-aware molecular dynamics. The concrete benchmark gains (CRPS and Spearman lifts) constitute a clear, falsifiable advance worth testing on additional OOD regimes.

major comments (1)

The claim that learned functional perturbations, when finetuned with CRPS, adequately represent epistemic uncertainty on out-of-distribution atomic configurations rests on the N-body and silica results; however, the manuscript provides insufficient detail on data splits, OOD construction, and verification that the observed gains (19-32% CRPS, 0.75 to 0.84 Spearman) arise from epistemic rather than in-distribution calibration improvements.

minor comments (1)

The abstract and methods description omit explicit statements of the perturbation parameterization, the precise form of the CRPS loss, and the training protocol for P-Orb finetuning; adding these would strengthen reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the work's potential impact. We address the major comment below and will incorporate clarifications and additional details in the revised manuscript.

read point-by-point responses

Referee: The claim that learned functional perturbations, when finetuned with CRPS, adequately represent epistemic uncertainty on out-of-distribution atomic configurations rests on the N-body and silica results; however, the manuscript provides insufficient detail on data splits, OOD construction, and verification that the observed gains (19-32% CRPS, 0.75 to 0.84 Spearman) arise from epistemic rather than in-distribution calibration improvements.

Authors: We agree that greater explicitness on data splits and OOD construction will strengthen the presentation. In the revised manuscript we will add a dedicated paragraph in the Experiments section that specifies: (i) for the N-body benchmark, training configurations are generated with 5–10 particles while test sets include systems with 15–20 particles to induce controlled distributional shift; (ii) for the silica benchmark, the OOD subset is constructed from trajectories at temperatures and defect densities outside the training distribution. On the epistemic-versus-calibration question, the functional perturbations are introduced precisely to allow the model to express epistemic variability in the learned potential; CRPS training then optimizes the entire predictive distribution under this variability. The reported CRPS gains are measured on the shifted test distributions, and the Spearman improvement quantifies better ranking of actual errors by the predicted uncertainty—precisely the behavior expected when epistemic uncertainty is better captured. We will include a short discussion of this distinction together with a supplementary plot of uncertainty–error correlation stratified by in-distribution versus OOD subsets. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new perturbations and CRPS objective are independent of inputs

full rationale

The paper's central derivation introduces learned functional perturbations applied to a deterministic MLIP, followed by end-to-end finetuning using the CRPS proper scoring rule. This construction does not reduce by definition or by the paper's equations to any previously fitted parameter or self-citation chain. The reported gains (CRPS improvements of 19-32% on N-body, Spearman lift from 0.75 to 0.84 on silica) are presented as empirical outcomes of the new training procedure rather than tautological renamings or fitted-input predictions. No self-definitional steps, uniqueness theorems imported from the same authors, or ansatz smuggling via prior work appear in the provided text. The approach is self-contained against external benchmarks and does not rely on load-bearing self-citations for its core claim.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

Abstract-only access limits visibility into exact parameter counts or background assumptions; the primary addition appears to be the perturbation mechanism itself.

free parameters (1)

learned perturbation parameters
Parameters introduced to create functional perturbations; their number and initialization are not specified in the abstract.

invented entities (1)

learned functional perturbations no independent evidence
purpose: To convert a deterministic MLIP into a probabilistic model without architectural redesign
Introduced as the core mechanism to enable uncertainty quantification via end-to-end training.

pith-pipeline@v0.9.0 · 5767 in / 1044 out tokens · 39942 ms · 2026-05-20T04:22:16.238849+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

turn a deterministic MLIP into a probabilistic one through learned functional perturbations and finetune it end-to-end with the Continuous Ranked Probability Score (CRPS)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.