arxiv: 2605.08988 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI· cs.CE

Recognition: 2 theorem links

· Lean Theorem

Benchmarking Compositional Generalisation for Machine Learning Interatomic Potentials

Amir Masoud Nourollah , Irtaza Khalid , Stefano Leoni , Steven Schockaert

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CE

keywords machine learning interatomic potentialscompositional generalizationout-of-distribution evaluationbenchmarkmolecular forcescomputational chemistryfoundation modelsmolecular dynamics

0 comments

The pith

Machine learning interatomic potentials show errors an order of magnitude higher on out-of-distribution molecules than on training examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmark of four tasks that test whether machine learning models for interatomic potentials can generalize compositionally to molecules they have not seen before. In each task the training data is arranged so that success should be possible for any model that has learned how molecular fragments combine according to physical principles rather than memorizing specific patterns. Results across multiple state-of-the-art models, including large foundation models, show substantially elevated errors on the held-out cases. This matters because interatomic potentials are used to simulate new molecules in drug design and materials discovery; if the models only interpolate within the training distribution their predictions for novel compounds cannot be trusted. The work therefore supplies concrete evidence that current approaches fall short of the compositional understanding required for reliable extrapolation.

Core claim

The authors construct a benchmark consisting of four tasks that require compositional generalisation. Models are trained on molecules whose fragments appear in controlled combinations and then tested on molecules that contain unseen combinations of the same fragments. The training sets are deliberately chosen so that any model capturing the underlying physical rules of fragment interaction should be able to predict forces and energies accurately on the test molecules. Empirical evaluation demonstrates that errors on these out-of-distribution examples are typically an order of magnitude larger than on in-distribution examples, even when the models have been pre-trained on millions of other分子.

What carries the argument

A benchmark of four tasks that isolate compositional generalisation by holding out specific fragment combinations while ensuring the training distribution covers the physical principles needed to predict their interactions.

If this is right

Current models primarily learn patterns tied to the specific training molecules rather than the compositional rules that determine molecular properties.
Pre-training on millions of molecules does not close the generalization gap for these compositional tasks.
Predictions for previously unseen molecules in applications such as drug design or materials discovery will carry substantially higher uncertainty.
New model architectures or training strategies that explicitly encode fragment composition and physical combination rules are required to improve reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be extended to other molecular properties such as electronic structure or reactivity to test whether the same compositional gap appears.
If the performance gap persists across different model families it suggests that purely data-driven scaling may not be sufficient and hybrid physics-informed architectures may be necessary.
Similar compositional benchmarks could be developed for related domains such as protein-ligand binding or crystal structure prediction to assess generalization limits more broadly.

Load-bearing premise

The training data is chosen such that generalisation to the test examples should be feasible for models that learn the underlying physical principles.

What would settle it

A single state-of-the-art model that achieves force and energy errors on the out-of-distribution test sets that are within a factor of two of its in-distribution errors would falsify the claim that the tasks are highly challenging for compositional generalisation.

Figures

Figures reproduced from arXiv: 2605.08988 by Amir Masoud Nourollah, Irtaza Khalid, Stefano Leoni, Steven Schockaert.

**Figure 2.** Figure 2: Generalisation across all four tasks. (a, c) Log-scaled ID vs. OOD MAE for forces and total energy. Marker shape encodes the task; colour encodes the model. (b, d) Per-molecule MAE for Fragment Chain Extension, showing extrapolation from training alkanes (C2–C6, light blue) to OOD alkanes (C7–C13, light red). Naming convention: ‘MACE’ denotes models trained from scratch on each task; ‘MACE-Small/Medium/Lar… view at source ↗

**Figure 3.** Figure 3: In-distribution and out-of-distribution performance of all evaluated models across the four [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Per-timestep prediction analysis for three representative models on the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Force error decomposition for selected evaluated models on the [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Supplementary force and energy analysis across all four GMD tasks. Panels (a), (c), and (e) [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Supplementary bar-plot decomposition across all four GMD tasks for the three additional [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Per-trajectory residual analysis for the [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Per-trajectory residual analysis for the [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Per-trajectory residual analysis for the [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Per-trajectory residual analysis for the [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Per-trajectory residual analysis comparing foundation-model variants on a representative [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Element-specific force error decomposition on the [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Element-specific force error decomposition on the [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Element-specific force error decomposition on the [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 15.** Figure 15: (Continued.) Element-specific force error decomposition on the [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: Element-specific force error decomposition on the [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗

**Figure 17.** Figure 17: ID versus OOD performance on both augmented variants for all evaluated models. Squares: [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗

**Figure 18.** Figure 18: Aggregate ID and OOD performance on the augmented variants. Top row: Augmented [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗

**Figure 19.** Figure 19: Augmented Fragment Chain Extension: per-molecule MAE as a function of carbon chain [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗

**Figure 20.** Figure 20: Per-trajectory residual analysis for the [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗

**Figure 21.** Figure 21: Element-specific force error decomposition on the [PITH_FULL_IMAGE:figures/full_fig_p034_21.png] view at source ↗

**Figure 22.** Figure 22: Foundation-model fine-tuning regime: ID versus OOD performance across the four [PITH_FULL_IMAGE:figures/full_fig_p037_22.png] view at source ↗

**Figure 23.** Figure 23: Performance of five from-scratch MLIP architectures on the Fragment Chain Extension [PITH_FULL_IMAGE:figures/full_fig_p043_23.png] view at source ↗

**Figure 24.** Figure 24: 2D structured molecules for generalisation tasks covered by our GMD benchmark. In each [PITH_FULL_IMAGE:figures/full_fig_p048_24.png] view at source ↗

**Figure 25.** Figure 25: 2D structured molecules for the auxiliary tasks in our GMD benchmark. For [PITH_FULL_IMAGE:figures/full_fig_p049_25.png] view at source ↗

read the original abstract

Machine Learning Interatomic Potentials play a fundamental role in computational chemistry and materials science, enabling applications from molecular dynamics simulations to drug design and materials discovery. While recent approaches can estimate inter-atomic forces with high precision, it remains unclear to what extent they can generalise to previously unseen molecules. Do they learn the compositional structure of chemistry, capturing how molecular fragments and their combinations determine properties, or do they primarily learn to interpolate patterns that are specific to the training examples? To address this question, we propose a benchmark consisting of four tasks that require some form of compositional generalisation. In each task, models are tested on molecules that were unseen during training, but the training data is chosen such that generalisation to the test examples should be feasible for models that learn the underlying physical principles. Our empirical analysis shows that the considered tasks are highly challenging for state-of-the-art models, with errors on out-of-distribution examples often an order of magnitude higher than on in-distribution examples, even when using foundation models that have been pre-trained on millions of molecules.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces four benchmark tasks showing order-of-magnitude error increases for ML interatomic potentials on compositional generalization cases, but leaves the key feasibility assumption unverified.

read the letter

The main thing to know is that this work defines four new tasks where test molecules combine fragments seen in training but in unseen ways, and reports that current models including large foundation ones produce errors roughly ten times higher on those cases than on in-distribution data. That gap is the central empirical result and it is presented clearly from the abstract onward. The paper does a solid job of importing the compositional generalization testing framework from general ML into the interatomic potential setting and applying it to both standard architectures and pre-trained models. The quantified performance drop gives concrete numbers that others can reference when discussing extrapolation limits in this domain. The soft spot is exactly the one flagged in the stress test: the authors state that the splits were chosen so generalization should be feasible for any model that has extracted the underlying physical principles, yet the text provides no oracle experiment, no fixed physics baseline such as a fragment-additive model, and no explicit argument showing that every test property follows directly from the training fragments via known rules. Without that check, the large error gap could reflect missing non-local interactions or other factors rather than a pure failure of compositional learning. This paper is useful for groups working on ML potentials who need harder test cases for robustness. A reader focused on benchmark design or scientific ML generalization would get direct value from the task definitions and the reported numbers. It deserves a serious referee because the benchmark tasks are a genuine addition even if the causal interpretation needs tightening; the work is coherent on its own terms and the empirical measurements are straightforward to reproduce or extend.

Referee Report

2 major / 2 minor

Summary. The paper introduces a benchmark of four tasks to evaluate whether machine learning interatomic potentials (MLIPs) learn compositional structure in chemistry or merely interpolate training patterns. The central empirical claim is that state-of-the-art models, including foundation models pre-trained on millions of molecules, exhibit out-of-distribution (OOD) errors that are often an order of magnitude higher than in-distribution (ID) errors on these tasks, where the training data is asserted to allow generalization via underlying physical principles.

Significance. If the tasks are verifiably solvable by models that correctly extract physical composition rules, the benchmark would be a useful contribution for diagnosing generalization failures in MLIPs and guiding development of more principle-based models. The work is empirical and provides direct performance measurements without circular derivations.

major comments (2)

[Abstract and task-construction sections] Abstract and task-construction sections: The claim that 'the training data is chosen such that generalisation to the test examples should be feasible for models that learn the underlying physical principles' is asserted without supporting evidence such as an oracle physics-based baseline (e.g., fixed force-field or fragment-additive model), formal argument that every test-molecule property is inferable from training atom-type/context combinations via known physical rules, or verification that no non-local effects or missing interaction types are required. This assumption is load-bearing for attributing the observed OOD/ID gap specifically to compositional-generalization failure rather than inherent task infeasibility.
[Empirical-analysis section] Empirical-analysis section: The reported 'order of magnitude higher' OOD errors lack accompanying details on exact metrics, error-bar reporting, statistical controls, and explicit checks that the four tasks permit physical-principle-based generalization; without these, the central claim remains vulnerable to post-hoc task-design artifacts.

minor comments (2)

[Figures and notation] Clarify notation for ID/OOD splits and ensure all figures include explicit legends distinguishing the four tasks and reporting both mean errors and variability.
[Related work] Add a short related-work paragraph contrasting the proposed tasks with existing molecular generalization benchmarks to better situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and have revised the manuscript accordingly to strengthen the justification of the benchmark and the reporting of results.

read point-by-point responses

Referee: [Abstract and task-construction sections] Abstract and task-construction sections: The claim that 'the training data is chosen such that generalisation to the test examples should be feasible for models that learn the underlying physical principles' is asserted without supporting evidence such as an oracle physics-based baseline (e.g., fixed force-field or fragment-additive model), formal argument that every test-molecule property is inferable from training atom-type/context combinations via known physical rules, or verification that no non-local effects or missing interaction types are required. This assumption is load-bearing for attributing the observed OOD/ID gap specifically to compositional-generalization failure rather than inherent task infeasibility.

Authors: We agree that the load-bearing assumption requires more explicit support. In the revised manuscript we have added, for each of the four tasks, a concise argument grounded in standard chemical principles (e.g., fragment additivity for energies and local-environment dependence for forces) together with a simple oracle baseline that implements a fragment-additive model using only quantities observable in the training set. This baseline recovers low error on the OOD test sets, indicating that the tasks are solvable once the relevant compositional rules are known. We also note that the tasks were deliberately constructed to avoid non-local or many-body effects outside the scope of the training atom-type and context combinations; any remaining higher-order interactions are negligible for the properties and molecular sizes considered. revision: yes
Referee: [Empirical-analysis section] Empirical-analysis section: The reported 'order of magnitude higher' OOD errors lack accompanying details on exact metrics, error-bar reporting, statistical controls, and explicit checks that the four tasks permit physical-principle-based generalization; without these, the central claim remains vulnerable to post-hoc task-design artifacts.

Authors: We have expanded the empirical-analysis section with the precise definitions of all reported metrics (MAE on energies and forces, with units), standard deviations computed over five independent training runs shown as error bars, and two-sided t-tests confirming that the OOD–ID gaps are statistically significant at p < 0.01 for every model and task. In addition, we now include a short verification subsection that cross-references each task to the physical principle it tests and confirms that the oracle baseline (described above) succeeds, thereby ruling out task infeasibility as the source of the observed gap. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements

full rationale

This paper is a purely empirical benchmark study proposing four compositional generalization tasks for ML interatomic potentials and reporting measured error gaps between in-distribution and out-of-distribution examples. No derivations, equations, fitted parameters, or predictions appear in the provided text. The design statement that 'the training data is chosen such that generalisation to the test examples should be feasible for models that learn the underlying physical principles' is a methodological premise about task construction, not a quantity derived from or equivalent to the reported performance results. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way. The central findings are direct empirical observations rather than quantities that reduce to the inputs by construction, satisfying the criteria for a self-contained benchmark without circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking paper with no mathematical derivations, free parameters, axioms, or invented physical entities; the central claim rests on the design of the four tasks and the observed performance differences.

pith-pipeline@v0.9.0 · 5494 in / 1142 out tokens · 37937 ms · 2026-05-12T02:12:36.865679+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the training data is chosen such that generalisation to the test examples should be feasible for models that learn the underlying physical principles
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our empirical analysis shows that the considered tasks are highly challenging for state-of-the-art models, with errors on out-of-distribution examples often an order of magnitude higher than on in-distribution examples

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

[1]

B. J. Alder and T. E. Wainwright. Studies in molecular dynamics. i. general method.The Journal of Chemical Physics, 31(2):459–466, 1959

work page 1959
[2]

E. R. Antoniuk, S. Zaman, T. Ben-Nun, P. Li, J. Diffenderfer, B. Demirci, O. Smolenski, T. Hsu, A. M. Hiszpanski, K. Chiu, et al. Boom: Benchmarking out-of-distribution molecular property predictions of machine learning models.arXiv preprint arXiv:2505.01912, 2025

work page arXiv 2025
[3]

Bannwarth, S

C. Bannwarth, S. Ehlert, and S. Grimme. Gfn2-xtb—an accurate and broadly parametrized self-consistent tight-binding quantum chemical method with multipole electrostatics and density- dependent dispersion contributions.Journal of chemical theory and computation, 15(3):1652– 1671, 2019

work page 2019
[4]

Batatia, D

I. Batatia, D. P. Kovacs, G. Simm, C. Ortner, and G. Csányi. Mace: Higher order equivari- ant message passing neural networks for fast and accurate force fields.Advances in neural information processing systems, 35:11423–11436, 2022

work page 2022
[5]

Batatia, P

I. Batatia, P. Benner, Y . Chiang, A. M. Elena, D. P. Kovács, J. Riebesell, X. R. Advincula, M. Asta, M. Avaylon, W. J. Baldwin, et al. A foundation model for atomistic materials chemistry. The Journal of chemical physics, 163(18), 2025

work page 2025
[6]

Batzner, A

S. Batzner, A. Musaelian, L. Sun, M. Geiger, J. P. Mailoa, M. Kornbluth, N. Molinari, T. E. Smidt, and B. Kozinsky. E (3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials.Nature communications, 13(1):2453, 2022

work page 2022
[7]

F. Bigi, S. Chong, A. Kristiadi, and M. Ceriotti. Flashmd: long-stride, universal prediction of molecular dynamics.arXiv preprint arXiv:2505.19350, 2025

work page arXiv 2025
[8]

Chmiela, A

S. Chmiela, A. Tkatchenko, H. E. Sauceda, I. Poltavsky, K. T. Schütt, and K.-R. Müller. Machine learning of accurate energy-conserving molecular force fields.Science advances, 3(5):e1603015, 2017

work page 2017
[9]

Chmiela, V

S. Chmiela, V . Vassilev-Galindo, O. T. Unke, A. Kabylda, H. E. Sauceda, A. Tkatchenko, and K.-R. Müller. Accurate global machine learning force fields for molecules with hundreds of atoms.Science Advances, 9(2):eadf0873, 2023

work page 2023
[10]

B. Deng, Y . Choi, P. Zhong, J. Riebesell, S. Anand, Z. Li, K. Jun, K. A. Persson, and G. Ceder. Systematic softening in universal machine learning interatomic potentials.npj Computational Materials, 11(1):9, 2025

work page 2025
[11]

Ektefaie, A

Y . Ektefaie, A. Shen, D. Bykova, M. G. Marin, M. Zitnik, and M. Farhat. Evaluating generaliz- ability of artificial intelligence models for molecular datasets.Nature Machine Intelligence, 6 (12):1512–1524, 2024. 10

work page 2024
[12]

D. A. Erlanson, S. W. Fesik, R. E. Hubbard, W. Jahnke, and H. Jhoti. Twenty years on: the impact of fragments on drug discovery.Nature reviews Drug discovery, 15(9):605–619, 2016

work page 2016
[13]

R. Feng, Q. Zhu, H. Tran, B. Chen, A. Toland, R. Ramprasad, and C. Zhang. May the force be with you: Unified force-centric pre-training for 3d molecular conformations.Advances in neural information processing systems, 36:72750–72760, 2023

work page 2023
[14]

Focassio, L

B. Focassio, L. P. M. Freitas, and G. R. Schleder. Performance assessment of universal machine learning interatomic potentials: Challenges and directions for materials’ surfaces.ACS Applied Materials & Interfaces, 17(9):13111–13121, 2024

work page 2024
[15]

J. A. Fodor and Z. W. Pylyshyn. Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1-2):3–71, 1988

work page 1988
[16]

Margraf, and Stephan Günnemann

J. Gasteiger, S. Giri, J. T. Margraf, and S. Günnemann. Fast and uncertainty-aware directional message passing for non-equilibrium molecules.arXiv preprint arXiv:2011.14115, 2020

work page arXiv 2011
[17]

Gasteiger, F

J. Gasteiger, F. Becker, and S. Günnemann. Gemnet: Universal directional graph neural networks for molecules.Advances in Neural Information Processing Systems, 34:6790–6802, 2021

work page 2021
[18]

Lawrence Zitnick, and Abhishek Das

J. Gasteiger, M. Shuaibi, A. Sriram, S. Günnemann, Z. Ulissi, C. L. Zitnick, and A. Das. Gemnet-oc: developing graph neural networks for large and diverse molecular simulation datasets.arXiv preprint arXiv:2204.02782, 2022

work page arXiv 2022
[19]

Gilmer, S

J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. InInternational conference on machine learning, pages 1263–1272. PMLR, 2017

work page 2017
[20]

Compositionality decomposed: How do neural networks generalise?Journal of Artificial Intelligence Research, 67:757– 795, 2020

D. Hupkes, V . Dankers, M. Mul, and E. Bruni. Compositionality decomposed: How do neural networks generalise?J. Artif. Intell. Res., 67:757–795, 2020. doi: 10.1613/JAIR.1.11674. URL https://doi.org/10.1613/jair.1.11674

work page doi:10.1613/jair.1.11674 2020
[21]

Y . Ji, L. Zhang, J. Wu, B. Wu, L. Li, L.-K. Huang, T. Xu, Y . Rong, J. Ren, D. Xue, et al. Drugood: Out-of-distribution dataset curator and benchmark for ai-aided drug discovery–a focus on affinity prediction problems with noise annotations. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 8023–8031, 2023

work page 2023
[22]

Khalid and S

I. Khalid and S. Schockaert. Systematic relational reasoning with epistemic graph neural networks. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview. net/forum?id=qNp86ByQlN

work page 2025
[23]

A. H. Larsen, J. J. Mortensen, J. Blomqvist, I. E. Castelli, R. Christensen, M. Dułak, J. Friis, M. N. Groves, B. Hammer, C. Hargus, et al. The atomic simulation environment—a python library for working with atoms.Journal of Physics: Condensed Matter, 29(27):273002, 2017

work page 2017
[24]

H. Li, X. Wang, Z. Zhang, and W. Zhu. Ood-gnn: Out-of-distribution generalized graph neural network.IEEE Transactions on Knowledge and Data Engineering, 35(7):7328–7340, 2022

work page 2022
[25]

Y .-L. Liao, B. Wood, A. Das, and T. Smidt. Equiformerv2: Improved equivariant transformer for scaling to higher-degree representations.arXiv preprint arXiv:2306.12059, 2023

work page arXiv 2023
[26]

Mazitov, F

A. Mazitov, F. Bigi, M. Kellner, P. Pegolo, D. Tisi, G. Fraux, S. Pozdnyakov, P. Loche, and M. Ceriotti. Pet-mad as a lightweight universal interatomic potential for advanced materials modeling.Nature Communications, 16(1):10653, 2025

work page 2025
[27]

McIntosh-Smith, S

S. McIntosh-Smith, S. Alam, and C. Woods. Isambard-ai: a leadership-class supercomputer optimised specifically for artificial intelligence. InProceedings of the Cray User Group, pages 44–54. 2024

work page 2024
[28]

A. T. Muller, J. A. Hiss, and G. Schneider. Recurrent neural network model for constructive peptide design.Journal of chemical information and modeling, 58(2):472–479, 2018. 11

work page 2018
[29]

F. Neese. The orca program system.WIRES Comput. Molec. Sci., 2(1):73–78, 2012. doi: 10.1002/wcms.81

work page doi:10.1002/wcms.81 2012
[30]

Neumann, J

M. Neumann, J. Gin, B. Rhodes, S. Bennett, Z. Li, H. Choubisa, A. Hussey, and J. Godwin. Orb: A fast, scalable neural network potential.arXiv preprint arXiv:2410.22570, 2024

work page arXiv 2024
[31]

S. S. Omee, N. Fu, R. Dong, M. Hu, and J. Hu. Structure-based out-of-distribution (ood) materials property prediction: a benchmark study.npj Computational Materials, 10(1):144, 2024

work page 2024
[32]

Passaro and C

S. Passaro and C. L. Zitnick. Reducing so (3) convolutions to so (2) for efficient equivariant gnns. InInternational conference on machine learning, pages 27420–27438. PMLR, 2023

work page 2023
[33]

Pengmei, J

Z. Pengmei, J. Liu, and Y . Shu. Beyond md17: the reactive xxmd dataset.Scientific Data, 11 (1):222, 2024

work page 2024
[34]

Pinheiro Jr, S

M. Pinheiro Jr, S. Zhang, P. O. Dral, and M. Barbatti. Ws22 database, wigner sampling and geometry interpolation for configurationally diverse molecular datasets.Scientific Data, 10(1): 95, 2023

work page 2023
[35]

A. Rahman. Correlations in the motion of atoms in liquid argon.Physical review, 136(2A): A405, 1964

work page 1964
[36]

Ramakrishnan, P

R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. V on Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules.Scientific data, 1(1):1–7, 2014

work page 2014
[37]

Schlick.Molecular modeling and simulation: an interdisciplinary guide, volume 2

T. Schlick.Molecular modeling and simulation: an interdisciplinary guide, volume 2. Springer, 2010

work page 2010
[38]

Schreiner, A

M. Schreiner, A. Bhowmik, T. Vegge, J. Busk, and O. Winther. Transition1x-a dataset for building generalizable reactive machine learning potentials.Scientific Data, 9(1):779, 2022

work page 2022
[39]

Schütt, O

K. Schütt, O. Unke, and M. Gastegger. Equivariant message passing for the prediction of tensorial properties and molecular spectra. InInternational Conference on Machine Learning, pages 9377–9388. PMLR, 2021

work page 2021
[40]

K. T. Schütt, H. E. Sauceda, P.-J. Kindermans, A. Tkatchenko, and K.-R. Müller. Schnet–a deep learning architecture for molecules and materials.The Journal of Chemical Physics, 148(24), 2018

work page 2018
[41]

J. S. Smith, O. Isayev, and A. E. Roitberg. Ani-1, a data set of 20 million calculated off- equilibrium conformations for organic molecules.Scientific data, 4(1):1–8, 2017

work page 2017
[42]

B. M. Wood, M. Dzamba, X. Fu, M. Gao, M. Shuaibi, L. Barroso-Luque, K. Abdelmaqsoud, V . Gharakhanyan, J. R. Kitchin, D. S. Levine, et al. Uma: A family of universal models for atoms.arXiv preprint arXiv:2506.23971, 2025

work page arXiv 2025
[43]

K. Xu, J. Li, M. Zhang, S. S. Du, K. Kawarabayashi, and S. Jegelka. What can neural networks reason about? In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview. net/forum?id=rJxbJeHFPS

work page 2020
[44]

H. Zhou, A. Bradley, E. Littwin, N. Razin, O. Saremi, J. M. Susskind, S. Bengio, and P. Nakkiran. What algorithms can transformers learn? A study in length generalization. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

work page 2024
[45]

carboxylic acid is to alcohol+aldehyde as amide is to amine+aldehyde

OpenReview.net, 2024. URLhttps://openreview.net/forum?id=AssIuHnmHX. 12 A Additional details and analysis A.1 Experiments with additional evaluation metrics Force MAE conflates errors in both the direction and magnitude of the predicted force vectors, while total energy MAE is inherently confounded by molecular size. To disentangle these distinct sources ...

work page 2024