pith. sign in

arxiv: 2606.31660 · v1 · pith:DEKZEUJNnew · submitted 2026-06-30 · ⚛️ physics.chem-ph · cond-mat.mtrl-sci

Contrastive Regularization of Machine Learning Potentials

Pith reviewed 2026-07-01 02:59 UTC · model grok-4.3

classification ⚛️ physics.chem-ph cond-mat.mtrl-sci
keywords machine learning potentialscontrastive regularizationmolecular dynamics samplingMD17 datasetenergy-based modelsKullback-Leibler divergenceLangevin dynamicsthermodynamic distributions
0
0 comments X

The pith

Machine learning potentials trained only on pointwise errors produce wrong thermodynamic distributions even when energies are chemically accurate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Machine learning interatomic potentials are meant to drive molecular simulations whose averages match equilibrium distributions from reference methods like DFT. Standard MSE training on DFT data can achieve chemical accuracy on held-out points yet still lets trajectories drift into spurious low-energy minima, distorting energy distributions, interatomic distances, and free-energy profiles. The paper demonstrates this failure on ethanol and aspirin from the MD17 dataset. It introduces Contrastive Regularized MSE (CRMSE) as a post-training correction that adds a term from the KL divergence between the model's implicit distribution and the target. The network generates its own corrective configurations via persistent Langevin chains and raises the energy of unphysical ones, recovering near-quantitative agreement with DFT observables without new ab initio data or loss of force accuracy.

Core claim

Potentials trained by MSE minimization on DFT data reach chemical accuracy on test points yet fail as samplers because their generated Boltzmann distribution differs from the target; augmenting the loss with a contrastive term derived from the Kullback-Leibler divergence, obtained by running persistent Langevin chains that expose the network's own spurious low-energy minima, confines trajectories to the physical basin and recovers the energy distribution, interatomic-distance distributions, and dihedral free-energy profiles to near-quantitative agreement with DFT while preserving force accuracy and remaining effective under reduced training data.

What carries the argument

Contrastive Regularized MSE (CRMSE), which augments pointwise MSE with a contrastive penalty from the KL divergence between the potential's implicit distribution and the target, using the trained network itself as an energy-based model whose persistent Langevin chains supply the negative samples.

If this is right

  • Trajectories stay confined to physical basins rather than drifting into unphysical minima.
  • Energy distributions, interatomic-distance histograms, and dihedral free-energy profiles match DFT to near-quantitative accuracy.
  • Force accuracy on physical configurations is preserved and energy errors remain within chemical accuracy.
  • The correction works even when the original training set is sharply reduced.
  • Distribution-level matching is a distinct requirement separate from pointwise regression accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same contrastive approach could be tested on other learned potentials used as generators in simulations beyond molecular systems.
  • Benchmarks for ML potentials may need to add explicit sampling-quality tests in addition to pointwise error metrics.
  • Self-generated negative samples via dynamics could reduce the required volume of expensive ab initio data for reliable models.
  • The method implies that training objectives should be chosen to match the downstream use case of equilibrium sampling rather than isolated prediction accuracy.

Load-bearing premise

Persistent Langevin chains started from the trained potential will reliably expose the spurious low-energy minima responsible for sampling failure.

What would settle it

Apply CRMSE to another MD17 molecule or dataset and measure whether the resulting dihedral free-energy profiles still deviate from the DFT reference by more than chemical accuracy thresholds.

Figures

Figures reproduced from arXiv: 2606.31660 by Alberto Rosso, Dimitrios Tzivrailis, Eiji Kawasaki, Georgios Sotiropoulos.

Figure 1
Figure 1. Figure 1: FIG. 1: Schematic effect of CRMSE training on the learned energy landscape. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIG. 2 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FIG. 3: Representative ethanol molecule from the MD17 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: FIG. 5 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: FIG. 4: Performance of the MSE-trained GNN-LF [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: FIG. 6: Densities of interatomic distances for selected [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: FIG. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: FIG. 7: Performance of the CRMSE post-trained model [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 10
Figure 10. Figure 10: FIG. 10 [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 9
Figure 9. Figure 9: FIG. 9: Densities of interatomic distances for selected [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: FIG. 11: Free-energy profile along the H [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: FIG. 13: Free-energy profile along the ester dihedral [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: FIG. 14: Sensitivity of CRMSE to the regularization [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗
Figure 16
Figure 16. Figure 16: FIG. 16: Log-probability of the DFT total energy for [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗
Figure 15
Figure 15. Figure 15: FIG. 15: Log-probability of the DFT total energy for [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗
Figure 17
Figure 17. Figure 17: FIG. 17: 3D representation of the DFT aspirin molecule. [PITH_FULL_IMAGE:figures/full_fig_p015_17.png] view at source ↗
Figure 19
Figure 19. Figure 19: FIG. 19: Density of interatomic distances for selected [PITH_FULL_IMAGE:figures/full_fig_p015_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: the forces remain tightly aligned with the refer [PITH_FULL_IMAGE:figures/full_fig_p016_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: FIG. 21: Density of interatomic distances for selected [PITH_FULL_IMAGE:figures/full_fig_p016_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: FIG. 22: Performance of the MSE-trained GNN-LF [PITH_FULL_IMAGE:figures/full_fig_p017_22.png] view at source ↗
read the original abstract

Machine learning interatomic potentials are trained to predict energies and forces but built to be sampled: their purpose is to drive molecular simulations whose observables average over the equilibrium distribution the potential defines. They exemplify a broader AI problem -- learned regressors deployed as generators -- where pointwise accuracy does not guarantee a correct distribution. We show that potentials trained by standard Mean Squared Error (MSE) minimization on Density Functional Theory (DFT) data can reach chemical accuracy on held-out data, yet still fail as samplers: their trajectories drift into spurious low-energy minima and return thermodynamic observables that depart sharply from the reference. To correct this, we introduce Contrastive Regularized MSE (CRMSE), a post-training step that augments the MSE with a contrastive term derived from the Kullback--Leibler divergence between the potential's implicit Boltzmann distribution and the target. The network serves as its own energy-based model: persistent Langevin chains expose the configurations it drifts into and raise their energy, adding no new ab initio data. On the ethanol and aspirin molecules of the MD17 dataset, CRMSE confines the sampler to the physical basin and recovers the energy distribution, interatomic-distance distributions, and dihedral free-energy profiles to near-quantitative agreement with DFT, while preserving force accuracy and keeping energy errors within chemical accuracy; it remains effective when the training set is sharply reduced. That MSE training fails this way on MD17 -- one of the most widely used benchmarks -- while a minimal contrastive correction repairs it suggests that reliable sampling depends less on data volume than on training the model against the distribution it produces: distribution-level training is not a refinement of regression accuracy, but a distinct requirement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that ML interatomic potentials trained by MSE on MD17 DFT data (ethanol, aspirin) reach chemical accuracy on held-out points yet fail as samplers, drifting into spurious low-energy minima and producing incorrect thermodynamic observables. It introduces CRMSE, a post-training step that augments MSE with a contrastive term derived from the KL divergence between the model's implicit Boltzmann distribution and the target DFT distribution; the network acts as its own energy-based model, with persistent Langevin chains generating negative samples whose energies are raised. On the cited molecules, CRMSE is reported to confine sampling to the physical basin, recover energy, interatomic-distance, and dihedral free-energy distributions to near-quantitative DFT agreement, preserve force accuracy within chemical error, and remain effective under reduced training sets.

Significance. If the central result holds, the work is significant because it isolates a concrete failure mode of pointwise regression when the model is used generatively and demonstrates that a minimal, data-free contrastive correction can restore distribution-level fidelity on a standard benchmark. The approach of self-generated negative samples via persistent dynamics is economical and directly targets the sampling pathology without requiring additional ab initio calculations. It underscores that reliable thermodynamic sampling imposes requirements distinct from regression accuracy, with potential implications for other learned energy-based models.

major comments (2)
  1. [Abstract / CRMSE description] Abstract and method description: the claim that persistent Langevin chains 'expose the configurations it drifts into' is load-bearing, yet no diagnostic is supplied showing that the generated configurations lie outside the DFT-supported physical basin, populate the spurious minima responsible for MSE sampling collapse, or differ systematically from training-data neighborhoods. Without such verification (e.g., overlap metrics or energy histograms of negative samples versus DFT), the observed recovery of distributions cannot be unambiguously attributed to the contrastive term rather than incidental effects.
  2. [Abstract] Abstract: quantitative recovery of energy, distance, and dihedral distributions is asserted without reported error bars, multiple independent chain runs, or robustness checks against contrastive strength, Langevin step size, or chain length. The absence of these controls leaves open whether the near-quantitative agreement is stable or sensitive to hyperparameter choices, undermining the claim that CRMSE 'remains effective when the training set is sharply reduced.'
minor comments (1)
  1. The notation and explicit functional form of the contrastive term (derived from KL) would benefit from an equation in the main text to clarify how the negative-sample energies enter the loss.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract / CRMSE description] Abstract and method description: the claim that persistent Langevin chains 'expose the configurations it drifts into' is load-bearing, yet no diagnostic is supplied showing that the generated configurations lie outside the DFT-supported physical basin, populate the spurious minima responsible for MSE sampling collapse, or differ systematically from training-data neighborhoods. Without such verification (e.g., overlap metrics or energy histograms of negative samples versus DFT), the observed recovery of distributions cannot be unambiguously attributed to the contrastive term rather than incidental effects.

    Authors: We agree that explicit verification of the negative samples would strengthen the attribution. In the revised manuscript we will add energy histograms of configurations sampled from the persistent Langevin chains versus the DFT reference, together with overlap metrics (e.g., Wasserstein distance on energies and RMSD distributions) relative to both the training set and the physical basin. These diagnostics will be shown for both ethanol and aspirin to confirm that the contrastive term targets the spurious minima responsible for sampling collapse. revision: yes

  2. Referee: [Abstract] Abstract: quantitative recovery of energy, distance, and dihedral distributions is asserted without reported error bars, multiple independent chain runs, or robustness checks against contrastive strength, Langevin step size, or chain length. The absence of these controls leaves open whether the near-quantitative agreement is stable or sensitive to hyperparameter choices, undermining the claim that CRMSE 'remains effective when the training set is sharply reduced.'

    Authors: We acknowledge the lack of statistical controls and robustness tests in the current version. The revised manuscript will report results from multiple independent Langevin chains with error bars (standard deviation across runs) on all distribution metrics. We will also add sensitivity plots varying contrastive weight, Langevin step size, and chain length, and will repeat the reduced-training-set experiments under these controls to demonstrate that the recovery remains stable and effective. revision: yes

Circularity Check

0 steps flagged

No significant circularity; contrastive term from external KL and standard self-sampling technique

full rationale

The paper derives the contrastive regularization from the Kullback-Leibler divergence between the model's implicit Boltzmann distribution and the target DFT distribution, an external statistical principle applied post-training. Persistent Langevin chains on the network generate negative samples to raise energies of drifted configurations; this is a standard energy-based modeling approach and does not reduce the reported recovery of energy/distance/dihedral distributions to a fitted input or self-definition by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation. The central claim remains independently testable against MD17 DFT benchmarks and does not collapse to renaming or statistical forcing of the input MSE fit.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that self-generated Langevin trajectories expose the relevant failure modes and that penalizing them via KL-derived contrastive loss restores correct sampling without new data.

free parameters (1)
  • contrastive regularization strength
    Hyperparameter balancing the MSE loss against the contrastive term; its value must be chosen to achieve the reported recovery of distributions.
axioms (1)
  • domain assumption The trained network can serve as its own energy-based model whose Boltzmann distribution can be sampled via persistent Langevin dynamics
    Invoked when the method generates corrective configurations from the model itself rather than from external data.

pith-pipeline@v0.9.1-grok · 5845 in / 1492 out tokens · 53713 ms · 2026-07-01T02:59:16.288335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 2 canonical work pages

  1. [1]

    Robustness to reduced training data To assess whether CRMSE remains effective when the labeled dataset is substantially reduced, we repeat the ethanol experiment starting from a training set of only 200 DFT configurations. This is well below the maximum of 1000 structures recommended for rMD17—a ceiling set because the trajectory-derived configurations ar...

  2. [2]

    Generalization to aspirin Having established the validity of CRMSE on ethanol, we now examine whether the same correction generalises to a more complex molecule. As for ethanol, the surro- gate is trained on 950 aspirin configurations—the maxi- mum recommended for rMD17, whose trajectory-derived structures are temporally correlated so that larger sub- set...

  3. [3]

    Evaluation protocol For theλhyperparameter setting we perform the full CRMSE post-training starting from the same MSE- pretrained weights and momentum. We varyλabout the operating point used in the main text (λ= 0.015), while keeping the ULA step sizeη= 0.0004 and the number of steps per updateK= 200 fixed at their main-text values. The results are summar...

  4. [4]

    14 shows the effect ofλ, varied over [10 −5, 10] on a logarithmic scale

    Regularization strengthλ Fig. 14 shows the effect ofλ, varied over [10 −5, 10] on a logarithmic scale. The two metrics expose a trade- off with a broad stable plateau. For smallλthe con- trastive term is too weak to raise the energy of the out-of-distribution configurations, and the KS statistic remains large as the sampler continues to escape the physica...

  5. [5]

    Hohenberg and W

    P. Hohenberg and W. Kohn, Phys. Rev.136, B864 (1964)

  6. [6]

    Kohn and L

    W. Kohn and L. J. Sham, Phys. Rev.140, A1133 (1965)

  7. [7]

    Behler and M

    J. Behler and M. Parrinello, Phys. Rev. Lett.98, 146401 (2007)

  8. [8]

    K. T. Sch¨ utt, H. E. Sauceda, P.-J. Kindermans, A. Tkatchenko, and K.-R. M¨ uller, The Journal of Chem- ical Physics148, 241722 (2018)

  9. [9]

    G. Wang, C. Wang, X. Zhang, Z. Li, J. Zhou, and Z. Sun, iScience27, 109673 (2024)

  10. [10]

    Wang and M

    X. Wang and M. Zhang, inProceedings of the First Learning on Graphs Conference, Proceedings of Machine Learning Research, Vol. 198, edited by B. Rieck and R. Pascanu (PMLR, 2022) pp. 19:1–19:30

  11. [11]

    Ocampo, D

    D. Ocampo, D. Posso, R. Namakian, and W. Gao, Com- putational Materials Science244, 113155 (2024)

  12. [12]

    O. T. Unke, S. Chmiela, H. E. Sauceda, M. Gastegger, I. Poltavsky, K. T. Sch¨ utt, A. Tkatchenko, and K.-R. M¨ uller, Chemical Reviews121, 10142 (2021)

  13. [13]

    D. Wu, L. Wang, and P. Zhang, Phys. Rev. Lett.122, 080602 (2019)

  14. [14]

    Huembeli, J

    P. Huembeli, J. M. Arrazola, N. Killoran, M. Mohseni, and P. Wittek, Quantum Machine Intelligence4, 1 (2022)

  15. [15]

    Du and I

    Y. Du and I. Mordatch, inNeural Information Processing Systems(2019)

  16. [16]

    Gagnon and G

    L. Gagnon and G. Lajoie, Clarifying MCMC-based train- ing of modern EBMs : Contrastive Divergence versus Maximum Likelihood (2022), arXiv:2202.12176 [cs.LG]

  17. [17]

    Tieleman, inProceedings of the 25th International Conference on Machine Learning (ICML)(2008) pp

    T. Tieleman, inProceedings of the 25th International Conference on Machine Learning (ICML)(2008) pp. 1064–1071

  18. [18]

    M. Liu, K. Yan, B. Oztekin, and S. Ji, Graphebm: Molec- ular graph generation with energy-based models (2021), arXiv:2102.00546 [cs.LG]

  19. [19]

    No´ e, S

    F. No´ e, S. Olsson, J. K¨ ohler, and H. Wu, Science365, eaaw1147 (2019)

  20. [20]

    M. S. Shell, The Journal of Chemical Physics129, 144108 (2008)

  21. [21]

    W. G. Noid, J.-W. Chu, G. S. Ayton, V. Krishna, S. Izvekov, G. A. Voth, A. Das, and H. C. Andersen, The Journal of Chemical Physics128, 244114 (2008)

  22. [22]

    Thaler and J

    S. Thaler and J. Zavadlav, Nature Communications12, 6884 (2021)

  23. [23]

    Focassio, L

    B. Focassio, L. P. M. Freitas, and G. R. Schleder, ACS Applied Materials & Interfaces17, 13111 (2025)

  24. [24]

    Vandermause, S

    J. Vandermause, S. B. Torrisi, S. Batzner, Y. Xie, L. Sun, A. M. Kolpak, and B. Kozinsky, npj Computational Ma- terials6, 20 (2020)

  25. [25]

    Schran, K

    C. Schran, K. Brezina, and O. Marsalek, The Journal of Chemical Physics153, 104105 (2020)

  26. [26]

    Z. Yan, Z. Fan, and Y. Zhu, Journal of Chemical Infor- mation and Modeling66, 1406 (2026), pMID: 41610402

  27. [27]

    A. S. Christensen and A. von Lilienfeld, Revised MD17 dataset (rMD17), Figshare dataset (2020), figshare

  28. [28]

    P. J. Rossky, J. D. Doll, and H. L. Friedman, The Journal of Chemical Physics69, 4628 (1978)