pith. machine review for the scientific record. sign in

arxiv: 2605.08988 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI· cs.CE

Recognition: 2 theorem links

· Lean Theorem

Benchmarking Compositional Generalisation for Machine Learning Interatomic Potentials

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CE
keywords machine learning interatomic potentialscompositional generalizationout-of-distribution evaluationbenchmarkmolecular forcescomputational chemistryfoundation modelsmolecular dynamics
0
0 comments X

The pith

Machine learning interatomic potentials show errors an order of magnitude higher on out-of-distribution molecules than on training examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmark of four tasks that test whether machine learning models for interatomic potentials can generalize compositionally to molecules they have not seen before. In each task the training data is arranged so that success should be possible for any model that has learned how molecular fragments combine according to physical principles rather than memorizing specific patterns. Results across multiple state-of-the-art models, including large foundation models, show substantially elevated errors on the held-out cases. This matters because interatomic potentials are used to simulate new molecules in drug design and materials discovery; if the models only interpolate within the training distribution their predictions for novel compounds cannot be trusted. The work therefore supplies concrete evidence that current approaches fall short of the compositional understanding required for reliable extrapolation.

Core claim

The authors construct a benchmark consisting of four tasks that require compositional generalisation. Models are trained on molecules whose fragments appear in controlled combinations and then tested on molecules that contain unseen combinations of the same fragments. The training sets are deliberately chosen so that any model capturing the underlying physical rules of fragment interaction should be able to predict forces and energies accurately on the test molecules. Empirical evaluation demonstrates that errors on these out-of-distribution examples are typically an order of magnitude larger than on in-distribution examples, even when the models have been pre-trained on millions of other分子.

What carries the argument

A benchmark of four tasks that isolate compositional generalisation by holding out specific fragment combinations while ensuring the training distribution covers the physical principles needed to predict their interactions.

If this is right

  • Current models primarily learn patterns tied to the specific training molecules rather than the compositional rules that determine molecular properties.
  • Pre-training on millions of molecules does not close the generalization gap for these compositional tasks.
  • Predictions for previously unseen molecules in applications such as drug design or materials discovery will carry substantially higher uncertainty.
  • New model architectures or training strategies that explicitly encode fragment composition and physical combination rules are required to improve reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be extended to other molecular properties such as electronic structure or reactivity to test whether the same compositional gap appears.
  • If the performance gap persists across different model families it suggests that purely data-driven scaling may not be sufficient and hybrid physics-informed architectures may be necessary.
  • Similar compositional benchmarks could be developed for related domains such as protein-ligand binding or crystal structure prediction to assess generalization limits more broadly.

Load-bearing premise

The training data is chosen such that generalisation to the test examples should be feasible for models that learn the underlying physical principles.

What would settle it

A single state-of-the-art model that achieves force and energy errors on the out-of-distribution test sets that are within a factor of two of its in-distribution errors would falsify the claim that the tasks are highly challenging for compositional generalisation.

Figures

Figures reproduced from arXiv: 2605.08988 by Amir Masoud Nourollah, Irtaza Khalid, Stefano Leoni, Steven Schockaert.

Figure 1
Figure 1. Figure 1: An overview of the generalisation tasks covered by our GMD benchmark. In each task, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Generalisation across all four tasks. (a, c) Log-scaled ID vs. OOD MAE for forces and total energy. Marker shape encodes the task; colour encodes the model. (b, d) Per-molecule MAE for Fragment Chain Extension, showing extrapolation from training alkanes (C2–C6, light blue) to OOD alkanes (C7–C13, light red). Naming convention: ‘MACE’ denotes models trained from scratch on each task; ‘MACE-Small/Medium/Lar… view at source ↗
Figure 3
Figure 3. Figure 3: In-distribution and out-of-distribution performance of all evaluated models across the four [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-timestep prediction analysis for three representative models on the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Force error decomposition for selected evaluated models on the [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Supplementary force and energy analysis across all four GMD tasks. Panels (a), (c), and (e) [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Supplementary bar-plot decomposition across all four GMD tasks for the three additional [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-trajectory residual analysis for the [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-trajectory residual analysis for the [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Per-trajectory residual analysis for the [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Per-trajectory residual analysis for the [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Per-trajectory residual analysis comparing foundation-model variants on a representative [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Element-specific force error decomposition on the [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Element-specific force error decomposition on the [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Element-specific force error decomposition on the [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 15
Figure 15. Figure 15: (Continued.) Element-specific force error decomposition on the [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Element-specific force error decomposition on the [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: ID versus OOD performance on both augmented variants for all evaluated models. Squares: [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Aggregate ID and OOD performance on the augmented variants. Top row: Augmented [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Augmented Fragment Chain Extension: per-molecule MAE as a function of carbon chain [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Per-trajectory residual analysis for the [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Element-specific force error decomposition on the [PITH_FULL_IMAGE:figures/full_fig_p034_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Foundation-model fine-tuning regime: ID versus OOD performance across the four [PITH_FULL_IMAGE:figures/full_fig_p037_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Performance of five from-scratch MLIP architectures on the Fragment Chain Extension [PITH_FULL_IMAGE:figures/full_fig_p043_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: 2D structured molecules for generalisation tasks covered by our GMD benchmark. In each [PITH_FULL_IMAGE:figures/full_fig_p048_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: 2D structured molecules for the auxiliary tasks in our GMD benchmark. For [PITH_FULL_IMAGE:figures/full_fig_p049_25.png] view at source ↗
read the original abstract

Machine Learning Interatomic Potentials play a fundamental role in computational chemistry and materials science, enabling applications from molecular dynamics simulations to drug design and materials discovery. While recent approaches can estimate inter-atomic forces with high precision, it remains unclear to what extent they can generalise to previously unseen molecules. Do they learn the compositional structure of chemistry, capturing how molecular fragments and their combinations determine properties, or do they primarily learn to interpolate patterns that are specific to the training examples? To address this question, we propose a benchmark consisting of four tasks that require some form of compositional generalisation. In each task, models are tested on molecules that were unseen during training, but the training data is chosen such that generalisation to the test examples should be feasible for models that learn the underlying physical principles. Our empirical analysis shows that the considered tasks are highly challenging for state-of-the-art models, with errors on out-of-distribution examples often an order of magnitude higher than on in-distribution examples, even when using foundation models that have been pre-trained on millions of molecules.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a benchmark of four tasks to evaluate whether machine learning interatomic potentials (MLIPs) learn compositional structure in chemistry or merely interpolate training patterns. The central empirical claim is that state-of-the-art models, including foundation models pre-trained on millions of molecules, exhibit out-of-distribution (OOD) errors that are often an order of magnitude higher than in-distribution (ID) errors on these tasks, where the training data is asserted to allow generalization via underlying physical principles.

Significance. If the tasks are verifiably solvable by models that correctly extract physical composition rules, the benchmark would be a useful contribution for diagnosing generalization failures in MLIPs and guiding development of more principle-based models. The work is empirical and provides direct performance measurements without circular derivations.

major comments (2)
  1. [Abstract and task-construction sections] Abstract and task-construction sections: The claim that 'the training data is chosen such that generalisation to the test examples should be feasible for models that learn the underlying physical principles' is asserted without supporting evidence such as an oracle physics-based baseline (e.g., fixed force-field or fragment-additive model), formal argument that every test-molecule property is inferable from training atom-type/context combinations via known physical rules, or verification that no non-local effects or missing interaction types are required. This assumption is load-bearing for attributing the observed OOD/ID gap specifically to compositional-generalization failure rather than inherent task infeasibility.
  2. [Empirical-analysis section] Empirical-analysis section: The reported 'order of magnitude higher' OOD errors lack accompanying details on exact metrics, error-bar reporting, statistical controls, and explicit checks that the four tasks permit physical-principle-based generalization; without these, the central claim remains vulnerable to post-hoc task-design artifacts.
minor comments (2)
  1. [Figures and notation] Clarify notation for ID/OOD splits and ensure all figures include explicit legends distinguishing the four tasks and reporting both mean errors and variability.
  2. [Related work] Add a short related-work paragraph contrasting the proposed tasks with existing molecular generalization benchmarks to better situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and have revised the manuscript accordingly to strengthen the justification of the benchmark and the reporting of results.

read point-by-point responses
  1. Referee: [Abstract and task-construction sections] Abstract and task-construction sections: The claim that 'the training data is chosen such that generalisation to the test examples should be feasible for models that learn the underlying physical principles' is asserted without supporting evidence such as an oracle physics-based baseline (e.g., fixed force-field or fragment-additive model), formal argument that every test-molecule property is inferable from training atom-type/context combinations via known physical rules, or verification that no non-local effects or missing interaction types are required. This assumption is load-bearing for attributing the observed OOD/ID gap specifically to compositional-generalization failure rather than inherent task infeasibility.

    Authors: We agree that the load-bearing assumption requires more explicit support. In the revised manuscript we have added, for each of the four tasks, a concise argument grounded in standard chemical principles (e.g., fragment additivity for energies and local-environment dependence for forces) together with a simple oracle baseline that implements a fragment-additive model using only quantities observable in the training set. This baseline recovers low error on the OOD test sets, indicating that the tasks are solvable once the relevant compositional rules are known. We also note that the tasks were deliberately constructed to avoid non-local or many-body effects outside the scope of the training atom-type and context combinations; any remaining higher-order interactions are negligible for the properties and molecular sizes considered. revision: yes

  2. Referee: [Empirical-analysis section] Empirical-analysis section: The reported 'order of magnitude higher' OOD errors lack accompanying details on exact metrics, error-bar reporting, statistical controls, and explicit checks that the four tasks permit physical-principle-based generalization; without these, the central claim remains vulnerable to post-hoc task-design artifacts.

    Authors: We have expanded the empirical-analysis section with the precise definitions of all reported metrics (MAE on energies and forces, with units), standard deviations computed over five independent training runs shown as error bars, and two-sided t-tests confirming that the OOD–ID gaps are statistically significant at p < 0.01 for every model and task. In addition, we now include a short verification subsection that cross-references each task to the physical principle it tests and confirms that the oracle baseline (described above) succeeds, thereby ruling out task infeasibility as the source of the observed gap. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements

full rationale

This paper is a purely empirical benchmark study proposing four compositional generalization tasks for ML interatomic potentials and reporting measured error gaps between in-distribution and out-of-distribution examples. No derivations, equations, fitted parameters, or predictions appear in the provided text. The design statement that 'the training data is chosen such that generalisation to the test examples should be feasible for models that learn the underlying physical principles' is a methodological premise about task construction, not a quantity derived from or equivalent to the reported performance results. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way. The central findings are direct empirical observations rather than quantities that reduce to the inputs by construction, satisfying the criteria for a self-contained benchmark without circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking paper with no mathematical derivations, free parameters, axioms, or invented physical entities; the central claim rests on the design of the four tasks and the observed performance differences.

pith-pipeline@v0.9.0 · 5494 in / 1142 out tokens · 37937 ms · 2026-05-12T02:12:36.865679+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

  1. [1]

    B. J. Alder and T. E. Wainwright. Studies in molecular dynamics. i. general method.The Journal of Chemical Physics, 31(2):459–466, 1959

  2. [2]

    E. R. Antoniuk, S. Zaman, T. Ben-Nun, P. Li, J. Diffenderfer, B. Demirci, O. Smolenski, T. Hsu, A. M. Hiszpanski, K. Chiu, et al. Boom: Benchmarking out-of-distribution molecular property predictions of machine learning models.arXiv preprint arXiv:2505.01912, 2025

  3. [3]

    Bannwarth, S

    C. Bannwarth, S. Ehlert, and S. Grimme. Gfn2-xtb—an accurate and broadly parametrized self-consistent tight-binding quantum chemical method with multipole electrostatics and density- dependent dispersion contributions.Journal of chemical theory and computation, 15(3):1652– 1671, 2019

  4. [4]

    Batatia, D

    I. Batatia, D. P. Kovacs, G. Simm, C. Ortner, and G. Csányi. Mace: Higher order equivari- ant message passing neural networks for fast and accurate force fields.Advances in neural information processing systems, 35:11423–11436, 2022

  5. [5]

    Batatia, P

    I. Batatia, P. Benner, Y . Chiang, A. M. Elena, D. P. Kovács, J. Riebesell, X. R. Advincula, M. Asta, M. Avaylon, W. J. Baldwin, et al. A foundation model for atomistic materials chemistry. The Journal of chemical physics, 163(18), 2025

  6. [6]

    Batzner, A

    S. Batzner, A. Musaelian, L. Sun, M. Geiger, J. P. Mailoa, M. Kornbluth, N. Molinari, T. E. Smidt, and B. Kozinsky. E (3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials.Nature communications, 13(1):2453, 2022

  7. [7]

    F. Bigi, S. Chong, A. Kristiadi, and M. Ceriotti. Flashmd: long-stride, universal prediction of molecular dynamics.arXiv preprint arXiv:2505.19350, 2025

  8. [8]

    Chmiela, A

    S. Chmiela, A. Tkatchenko, H. E. Sauceda, I. Poltavsky, K. T. Schütt, and K.-R. Müller. Machine learning of accurate energy-conserving molecular force fields.Science advances, 3(5):e1603015, 2017

  9. [9]

    Chmiela, V

    S. Chmiela, V . Vassilev-Galindo, O. T. Unke, A. Kabylda, H. E. Sauceda, A. Tkatchenko, and K.-R. Müller. Accurate global machine learning force fields for molecules with hundreds of atoms.Science Advances, 9(2):eadf0873, 2023

  10. [10]

    B. Deng, Y . Choi, P. Zhong, J. Riebesell, S. Anand, Z. Li, K. Jun, K. A. Persson, and G. Ceder. Systematic softening in universal machine learning interatomic potentials.npj Computational Materials, 11(1):9, 2025

  11. [11]

    Ektefaie, A

    Y . Ektefaie, A. Shen, D. Bykova, M. G. Marin, M. Zitnik, and M. Farhat. Evaluating generaliz- ability of artificial intelligence models for molecular datasets.Nature Machine Intelligence, 6 (12):1512–1524, 2024. 10

  12. [12]

    D. A. Erlanson, S. W. Fesik, R. E. Hubbard, W. Jahnke, and H. Jhoti. Twenty years on: the impact of fragments on drug discovery.Nature reviews Drug discovery, 15(9):605–619, 2016

  13. [13]

    R. Feng, Q. Zhu, H. Tran, B. Chen, A. Toland, R. Ramprasad, and C. Zhang. May the force be with you: Unified force-centric pre-training for 3d molecular conformations.Advances in neural information processing systems, 36:72750–72760, 2023

  14. [14]

    Focassio, L

    B. Focassio, L. P. M. Freitas, and G. R. Schleder. Performance assessment of universal machine learning interatomic potentials: Challenges and directions for materials’ surfaces.ACS Applied Materials & Interfaces, 17(9):13111–13121, 2024

  15. [15]

    J. A. Fodor and Z. W. Pylyshyn. Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1-2):3–71, 1988

  16. [16]

    Margraf, and Stephan Günnemann

    J. Gasteiger, S. Giri, J. T. Margraf, and S. Günnemann. Fast and uncertainty-aware directional message passing for non-equilibrium molecules.arXiv preprint arXiv:2011.14115, 2020

  17. [17]

    Gasteiger, F

    J. Gasteiger, F. Becker, and S. Günnemann. Gemnet: Universal directional graph neural networks for molecules.Advances in Neural Information Processing Systems, 34:6790–6802, 2021

  18. [18]

    Lawrence Zitnick, and Abhishek Das

    J. Gasteiger, M. Shuaibi, A. Sriram, S. Günnemann, Z. Ulissi, C. L. Zitnick, and A. Das. Gemnet-oc: developing graph neural networks for large and diverse molecular simulation datasets.arXiv preprint arXiv:2204.02782, 2022

  19. [19]

    Gilmer, S

    J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. InInternational conference on machine learning, pages 1263–1272. PMLR, 2017

  20. [20]

    Compositionality decomposed: How do neural networks generalise?Journal of Artificial Intelligence Research, 67:757– 795, 2020

    D. Hupkes, V . Dankers, M. Mul, and E. Bruni. Compositionality decomposed: How do neural networks generalise?J. Artif. Intell. Res., 67:757–795, 2020. doi: 10.1613/JAIR.1.11674. URL https://doi.org/10.1613/jair.1.11674

  21. [21]

    Y . Ji, L. Zhang, J. Wu, B. Wu, L. Li, L.-K. Huang, T. Xu, Y . Rong, J. Ren, D. Xue, et al. Drugood: Out-of-distribution dataset curator and benchmark for ai-aided drug discovery–a focus on affinity prediction problems with noise annotations. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 8023–8031, 2023

  22. [22]

    Khalid and S

    I. Khalid and S. Schockaert. Systematic relational reasoning with epistemic graph neural networks. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview. net/forum?id=qNp86ByQlN

  23. [23]

    A. H. Larsen, J. J. Mortensen, J. Blomqvist, I. E. Castelli, R. Christensen, M. Dułak, J. Friis, M. N. Groves, B. Hammer, C. Hargus, et al. The atomic simulation environment—a python library for working with atoms.Journal of Physics: Condensed Matter, 29(27):273002, 2017

  24. [24]

    H. Li, X. Wang, Z. Zhang, and W. Zhu. Ood-gnn: Out-of-distribution generalized graph neural network.IEEE Transactions on Knowledge and Data Engineering, 35(7):7328–7340, 2022

  25. [25]

    Y .-L. Liao, B. Wood, A. Das, and T. Smidt. Equiformerv2: Improved equivariant transformer for scaling to higher-degree representations.arXiv preprint arXiv:2306.12059, 2023

  26. [26]

    Mazitov, F

    A. Mazitov, F. Bigi, M. Kellner, P. Pegolo, D. Tisi, G. Fraux, S. Pozdnyakov, P. Loche, and M. Ceriotti. Pet-mad as a lightweight universal interatomic potential for advanced materials modeling.Nature Communications, 16(1):10653, 2025

  27. [27]

    McIntosh-Smith, S

    S. McIntosh-Smith, S. Alam, and C. Woods. Isambard-ai: a leadership-class supercomputer optimised specifically for artificial intelligence. InProceedings of the Cray User Group, pages 44–54. 2024

  28. [28]

    A. T. Muller, J. A. Hiss, and G. Schneider. Recurrent neural network model for constructive peptide design.Journal of chemical information and modeling, 58(2):472–479, 2018. 11

  29. [29]

    F. Neese. The orca program system.WIRES Comput. Molec. Sci., 2(1):73–78, 2012. doi: 10.1002/wcms.81

  30. [30]

    Neumann, J

    M. Neumann, J. Gin, B. Rhodes, S. Bennett, Z. Li, H. Choubisa, A. Hussey, and J. Godwin. Orb: A fast, scalable neural network potential.arXiv preprint arXiv:2410.22570, 2024

  31. [31]

    S. S. Omee, N. Fu, R. Dong, M. Hu, and J. Hu. Structure-based out-of-distribution (ood) materials property prediction: a benchmark study.npj Computational Materials, 10(1):144, 2024

  32. [32]

    Passaro and C

    S. Passaro and C. L. Zitnick. Reducing so (3) convolutions to so (2) for efficient equivariant gnns. InInternational conference on machine learning, pages 27420–27438. PMLR, 2023

  33. [33]

    Pengmei, J

    Z. Pengmei, J. Liu, and Y . Shu. Beyond md17: the reactive xxmd dataset.Scientific Data, 11 (1):222, 2024

  34. [34]

    Pinheiro Jr, S

    M. Pinheiro Jr, S. Zhang, P. O. Dral, and M. Barbatti. Ws22 database, wigner sampling and geometry interpolation for configurationally diverse molecular datasets.Scientific Data, 10(1): 95, 2023

  35. [35]

    A. Rahman. Correlations in the motion of atoms in liquid argon.Physical review, 136(2A): A405, 1964

  36. [36]

    Ramakrishnan, P

    R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. V on Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules.Scientific data, 1(1):1–7, 2014

  37. [37]

    Schlick.Molecular modeling and simulation: an interdisciplinary guide, volume 2

    T. Schlick.Molecular modeling and simulation: an interdisciplinary guide, volume 2. Springer, 2010

  38. [38]

    Schreiner, A

    M. Schreiner, A. Bhowmik, T. Vegge, J. Busk, and O. Winther. Transition1x-a dataset for building generalizable reactive machine learning potentials.Scientific Data, 9(1):779, 2022

  39. [39]

    Schütt, O

    K. Schütt, O. Unke, and M. Gastegger. Equivariant message passing for the prediction of tensorial properties and molecular spectra. InInternational Conference on Machine Learning, pages 9377–9388. PMLR, 2021

  40. [40]

    K. T. Schütt, H. E. Sauceda, P.-J. Kindermans, A. Tkatchenko, and K.-R. Müller. Schnet–a deep learning architecture for molecules and materials.The Journal of Chemical Physics, 148(24), 2018

  41. [41]

    J. S. Smith, O. Isayev, and A. E. Roitberg. Ani-1, a data set of 20 million calculated off- equilibrium conformations for organic molecules.Scientific data, 4(1):1–8, 2017

  42. [42]

    B. M. Wood, M. Dzamba, X. Fu, M. Gao, M. Shuaibi, L. Barroso-Luque, K. Abdelmaqsoud, V . Gharakhanyan, J. R. Kitchin, D. S. Levine, et al. Uma: A family of universal models for atoms.arXiv preprint arXiv:2506.23971, 2025

  43. [43]

    K. Xu, J. Li, M. Zhang, S. S. Du, K. Kawarabayashi, and S. Jegelka. What can neural networks reason about? In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview. net/forum?id=rJxbJeHFPS

  44. [44]

    H. Zhou, A. Bradley, E. Littwin, N. Razin, O. Saremi, J. M. Susskind, S. Bengio, and P. Nakkiran. What algorithms can transformers learn? A study in length generalization. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

  45. [45]

    carboxylic acid is to alcohol+aldehyde as amide is to amine+aldehyde

    OpenReview.net, 2024. URLhttps://openreview.net/forum?id=AssIuHnmHX. 12 A Additional details and analysis A.1 Experiments with additional evaluation metrics Force MAE conflates errors in both the direction and magnitude of the predicted force vectors, while total energy MAE is inherently confounded by molecular size. To disentangle these distinct sources ...