pith. sign in

arxiv: 2606.29717 · v1 · pith:J4PSAILTnew · submitted 2026-06-29 · ❄️ cond-mat.mtrl-sci · cs.AI· cs.LG

Optimizing Expert-Designed Crystal Graph Networks for Band-Gap Prediction with an Autonomous LLM Research Loop

Pith reviewed 2026-06-30 05:46 UTC · model grok-4.3

classification ❄️ cond-mat.mtrl-sci cs.AIcs.LG
keywords crystal graph networksband-gap predictionMatBench benchmarkLLM coding agentautonomous optimizationmaterials machine learningspace-group embeddingmessage passing
0
0 comments X

The pith

An autonomous LLM coding agent built the top model for crystal band-gap prediction on the MatBench benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that a general-purpose coding agent can autonomously refine crystal graph networks to achieve the highest accuracy on the MatBench band-gap task among models trained without external pretraining. It outperforms all seventeen previously reported expert-designed models on a benchmark of more than 100,000 crystals. The agent reached this result by adding element-pair features to message-passing edges and incorporating crystal space-group embeddings, both of which are established techniques. The work therefore shows that LLM-driven loops can optimize expert machine-learning architectures for materials property prediction while also mapping the practical boundaries of such automation.

Core claim

On the MatBench band-gap benchmark (>100k crystals), a general-purpose coding agent autonomously built the most accurate model trained without external pretraining, ahead of all seventeen expert-designed models reported for the task. The agent reached this performance by implementing known methods: element-pair features on each message-passing edge and a crystal space-group embedding. The study both validates that LLM-agent autonomous research can optimize an expert-designed machine learning model for material property prediction and examines the limitations of the approach.

What carries the argument

Autonomous LLM research loop that iteratively codes, trains, and evaluates crystal graph networks for band-gap regression.

If this is right

  • The same agent loop can be applied to other MatBench tasks such as formation energy or elasticity prediction.
  • Element-pair edge features and space-group embeddings become standard additions to crystal graph architectures.
  • Autonomous coding agents reduce the human effort needed to reach state-of-the-art performance on fixed materials benchmarks.
  • Limitations identified in the agent loop will guide the design of more reliable autonomous research systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The result implies that many incremental gains in materials ML may already be reachable by systematic recombination of known components rather than novel inventions.
  • If the loop generalizes, similar agents could be deployed on private or newly collected crystal datasets without requiring a large team of domain experts.
  • The work leaves open whether the agent would discover genuinely new architectural motifs when the search space is expanded beyond current crystal-graph conventions.

Load-bearing premise

The agent's model was evaluated under identical conditions to the 17 expert models on the public benchmark, with no hidden advantages from the agent's implementation details or data handling.

What would settle it

Re-running the agent's final model on the public MatBench split and finding that its accuracy falls below at least one of the seventeen expert baselines.

Figures

Figures reproduced from arXiv: 2606.29717 by Boris I. Yakobson, Chenmu Zhang.

Figure 1
Figure 1. Figure 1: Mean absolute error (eV, log scale, lower is better) on the [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The lowest validation MAE reached (eV, solid black line) against experiment [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The best model’s mean absolute error (gold stars) relative to coGN’s, on [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The lowest held-out MAE reached, against experiment number, for two runs on [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Predicting a material's properties from its structure is a central, fast-advancing problem in computational materials science. A decade of work has produced standard public benchmarks and many published machine-learning models for the task (Dunn et al., 2020). The task's fixed metric and these baselines make it a natural setting for autonomous agent research (Karpathy, 2026). On the MatBench band-gap benchmark ($>$100k crystals), a general-purpose coding agent autonomously built the most accurate model trained without external pretraining, ahead of all seventeen expert-designed models reported for the task. A closer analysis shows it reached this by implementing known methods: either already standard in crystal neural-network models, or borrowed from other areas of machine learning. The contributing implementations include element-pair features on each message-passing edge and a crystal space-group embedding. The work not only demonstrates that LLM-agent autonomous research can optimize an expert-designed machine learning model for material property prediction, but also investigates the limitations of such autonomous research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript describes an autonomous LLM-based coding agent that optimizes an expert-designed crystal graph neural network for band-gap prediction. On the MatBench benchmark (>100k crystals), the resulting model—incorporating element-pair features on message-passing edges and crystal space-group embeddings—is reported to outperform all 17 previously published expert-designed models when trained without external pretraining. The work also examines the limitations of such autonomous optimization loops.

Significance. If the performance edge is shown to arise solely from the architectural choices under identical benchmark conditions, the result would illustrate that general-purpose coding agents can autonomously discover and implement known but effective techniques from crystal networks and broader ML literature, thereby demonstrating a viable path for accelerating model development in materials informatics while also surfacing practical constraints of agent-driven research.

major comments (2)
  1. [Abstract / Results] Abstract and main results section: the headline claim that the agent-built model is 'ahead of all seventeen expert-designed models' on the public MatBench band-gap benchmark is load-bearing for the paper's central contribution, yet the manuscript supplies no explicit statement or supplementary table confirming that the identical train/test splits, normalization, missing-value handling, and evaluation script from the MatBench repository were used. Any deviation would prevent attribution of the accuracy gain to the autonomous loop rather than implementation differences.
  2. [Methods] Methods and experimental details: the abstract states that the agent 'autonomously built' the model by implementing element-pair edge features and space-group embedding, but the manuscript provides no details on the exact agent prompts, the base architecture before modification, training hyperparameters, statistical validation (e.g., multiple random seeds or cross-validation), or ablation studies isolating the contribution of each added component. These omissions make it impossible to assess whether the reported superiority is robust.
minor comments (2)
  1. [Results] The manuscript should include a dedicated table or supplementary note listing the 17 baseline models with their reported MatBench metrics for direct comparison.
  2. [Methods] Notation for the crystal graph (e.g., definition of edge features and space-group embedding) should be introduced with an equation or diagram in the methods section to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and the emphasis on reproducibility and methodological transparency. Both major comments identify areas where additional explicit statements and details will strengthen the manuscript; we will incorporate revisions to address them directly.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and main results section: the headline claim that the agent-built model is 'ahead of all seventeen expert-designed models' on the public MatBench band-gap benchmark is load-bearing for the paper's central contribution, yet the manuscript supplies no explicit statement or supplementary table confirming that the identical train/test splits, normalization, missing-value handling, and evaluation script from the MatBench repository were used. Any deviation would prevent attribution of the accuracy gain to the autonomous loop rather than implementation differences.

    Authors: We agree that an explicit confirmation of benchmark fidelity is required to attribute performance differences to the autonomous loop. In the revised manuscript we will add a paragraph in the Methods section stating that the official MatBench train/test splits, normalization, missing-value handling, and evaluation scripts were followed exactly as released in the MatBench repository. We will also add a supplementary table that tabulates the key benchmark parameters and confirms adherence to the public protocol. revision: yes

  2. Referee: [Methods] Methods and experimental details: the abstract states that the agent 'autonomously built' the model by implementing element-pair edge features and space-group embedding, but the manuscript provides no details on the exact agent prompts, the base architecture before modification, training hyperparameters, statistical validation (e.g., multiple random seeds or cross-validation), or ablation studies isolating the contribution of each added component. These omissions make it impossible to assess whether the reported superiority is robust.

    Authors: We acknowledge the current manuscript is missing these implementation details. The revised Methods section will be expanded to report: the precise prompts supplied to the coding agent, the base architecture prior to any modifications, all training hyperparameters, performance statistics across multiple random seeds, and ablation experiments that isolate the contribution of the element-pair edge features and the space-group embedding. These additions will enable readers to evaluate robustness directly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark result on external public data

full rationale

The paper's central claim is an empirical performance comparison on the public MatBench band-gap benchmark (>100k crystals), where an LLM agent autonomously produced a model outperforming 17 expert baselines. No mathematical derivation, fitted parameters, or self-referential equations are presented. The result rests on external public benchmark data and reported model performances rather than internal definitions or self-citation chains. This matches the default expectation of a self-contained empirical finding with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical ML application study with no mathematical derivations, free parameters, or new physical axioms introduced; the central claim rests entirely on benchmark performance comparison.

pith-pipeline@v0.9.1-grok · 5715 in / 1158 out tokens · 16841 ms · 2026-06-30T05:46:40.730957+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 21 canonical work pages

  1. [1]

    Graph networks as a universal machine learning framework for molecules and crystals

    Chi Chen, Weike Ye, Yunxing Zuo, Chen Zheng, and Shyue Ping Ong. Graph networks as a universal machine learning framework for molecules and crystals. Chemistry of Materials, 31 0 (9): 0 3564--3572, 2019. doi:10.1021/acs.chemmater.9b01294

  2. [2]

    Atomistic line graph neural network for improved materials property predictions

    Kamal Choudhary and Brian DeCost. Atomistic line graph neural network for improved materials property predictions. npj Computational Materials, 7 0 (1): 0 185, 2021. doi:10.1038/s41524-021-00650-1

  3. [3]

    Materials property prediction for limited datasets enabled by feature selection and joint learning with modnet

    Pierre-Paul De Breuck, Geoffroy Hautier, and Gian-Marco Rignanese. Materials property prediction for limited datasets enabled by feature selection and joint learning with modnet. npj Computational Materials, 7 0 (1): 0 83, 2021. doi:10.1038/s41524-021-00552-2

  4. [4]

    Densegnn: universal and scalable deeper graph neural networks for high-performance property prediction in crystals and molecules

    Hongwei Du, Jiamin Wang, Jian Hui, Lanting Zhang, and Hong Wang. Densegnn: universal and scalable deeper graph neural networks for high-performance property prediction in crystals and molecules. npj Computational Materials, 10 0 (1): 0 292, 2024. doi:10.1038/s41524-024-01444-x

  5. [5]

    Benchmarking materials property prediction methods: the matbench test set and automatminer reference algorithm

    Alexander Dunn, Qi Wang, Alex Ganose, Daniel Dopp, and Anubhav Jain. Benchmarking materials property prediction methods: the matbench test set and automatminer reference algorithm. npj Computational Materials, 6 0 (1): 0 138, 2020. doi:10.1038/s41524-020-00406-3

  6. [6]

    Margraf, and Stephan G \"u nnemann

    Johannes Gasteiger, Shankari Giri, Johannes T. Margraf, and Stephan G \"u nnemann. Fast and uncertainty-aware directional message passing for non-equilibrium molecules, 2020. NeurIPS 2020 Machine Learning for Molecules Workshop

  7. [7]

    Combining feature-based approaches with graph neural networks and symbolic regression for synergistic performance and interpretability

    Rog \'e rio Almeida Gouv \^e a, Pierre-Paul De Breuck, Tatiane Pretto, Gian-Marco Rignanese, and Marcos Jos \'e Leite Santos. Combining feature-based approaches with graph neural networks and symbolic regression for synergistic performance and interpretability. npj Computational Materials, 12 0 (1): 0 67, January 2026. doi:10.1038/s41524-025-01938-2

  8. [8]

    Weinberger

    Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic depth. In European Conference on Computer Vision (ECCV), pp.\ 646--661. Springer, 2016. doi:10.1007/978-3-319-46493-0_39

  9. [9]

    Materials informatics transformer: A language model for interpretable materials properties prediction, 2023

    Hongshuo Huang, Rishikesh Magar, Changwen Xu, and Amir Barati Farimani. Materials informatics transformer: A language model for interpretable materials properties prediction, 2023

  10. [10]

    Formula graph self-attention network for representation-domain independent materials discovery

    Achintha Ihalage and Yang Hao. Formula graph self-attention network for representation-domain independent materials discovery. Advanced Science, 9 0 (18): 0 2200164, 2022. doi:10.1002/advs.202200164

  11. [11]

    Wider or deeper? scaling LLM inference-time compute with adaptive branching tree search

    Yuichi Inoue, Kou Misaki, Yuki Imajuku, So Kuroki, Taishi Nakamura, and Takuya Akiba. Wider or deeper? scaling LLM inference-time compute with adaptive branching tree search. In Advances in Neural Information Processing Systems (NeurIPS), 2025

  12. [12]

    LLMatDesign : Autonomous materials discovery with large language models, 2024

    Shuyi Jia, Chao Zhang, and Victor Fung. LLMatDesign : Autonomous materials discovery with large language models, 2024

  13. [13]

    AIDE : AI -driven exploration in the space of code, 2025

    Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. AIDE : AI -driven exploration in the space of code, 2025

  14. [14]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations (ICLR), 2024

  15. [15]

    autoresearch

    Andrej Karpathy. autoresearch. https://github.com/karpathy/autoresearch, March 2026

  16. [16]

    Matini-net: Versatile material informatics research framework for feature engineering and deep neural network design

    Myeonghun Lee, Taehyun Park, and Kyoungmin Min. Matini-net: Versatile material informatics research framework for feature engineering and deep neural network design. Journal of Chemical Information and Modeling, 64 0 (23): 0 8770--8783, 2024. doi:10.1021/acs.jcim.4c01676

  17. [17]

    doi: 10.1126/science.abq1158

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d'Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, P...

  18. [18]

    The ai scientist: Towards fully automated open-ended scientific discovery, 2024

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024

  19. [19]

    Alexander Novikov, Ng \^a n V \ u , Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve : A coding agent for scientific a...

  20. [20]

    Scalable deeper graph neural networks for high-performance materials property prediction

    Sadman Sadeed Omee, Steph-Yves Louis, Nihang Fu, Lai Wei, Sourin Dey, Rongzhi Dong, Qinyang Li, and Jianjun Hu. Scalable deeper graph neural networks for high-performance materials property prediction. Patterns, 3 0 (5): 0 100491, 2022. doi:10.1016/j.patter.2022.100491

  21. [21]

    Malliaros, and Joseph Musielewicz

    Ali Ramlaoui, Alexandre Duval, Hannah Bull, Victor Schmidt, Hugues Talbot, Fragkiskos D. Malliaros, and Joseph Musielewicz. TriForces : Augmenting atomistic GNNs for transferable representations, May 2026. Accepted at ICML 2026

  22. [22]

    Janosh Riebesell, Rhys E. A. Goodall, Philipp Benner, Yuan Chiang, Bowen Deng, Gerbrand Ceder, Mark Asta, Alpha A. Lee, Anubhav Jain, and Kristin A. Persson. A framework to evaluate machine learning crystal stability predictions. Nature Machine Intelligence, 7 0 (6): 0 836--847, June 2025. doi:10.1038/s42256-025-01055-1

  23. [23]

    Pawan Kumar, Emilien Dupont, Francisco J

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models. Nature, 625 0 (7995): 0 468--475, 2024. doi:10.1038/s41586-02...

  24. [24]

    Connectivity optimized nested line graph networks for crystal structures

    Robin Ruff, Patrick Reiser, Jan St \"u hmer, and Pascal Friederich. Connectivity optimized nested line graph networks for crystal structures. Digital Discovery, 3 0 (3): 0 594--601, 2024. doi:10.1039/d4dd00018h

  25. [25]

    u tt, H. E. Sauceda, P.-J. Kindermans, A. Tkatchenko, and K.-R. M \

    K. T. Sch \"u tt, H. E. Sauceda, P.-J. Kindermans, A. Tkatchenko, and K.-R. M \"u ller. SchNet -- a deep learning architecture for molecules and materials. The Journal of Chemical Physics, 148 0 (24): 0 241722, 2018. doi:10.1063/1.5019779

  26. [26]

    Sch \"u tt, Oliver T

    Kristof T. Sch \"u tt, Oliver T. Unke, and Michael Gastegger. Equivariant message passing for the prediction of tensorial properties and molecular spectra. In Proceedings of the 38th International Conference on Machine Learning (ICML), volume 139, pp.\ 9377--9388. PMLR, 2021

  27. [27]

    Kitchin, Zachary W

    Nima Shoghi, Adeesh Kolluru, John R. Kitchin, Zachary W. Ulissi, C. Lawrence Zitnick, and Brandon M. Wood. From molecules to materials: Pre-training large generalizable models for atomic property prediction. In The Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=PfPnugdxup

  28. [28]

    Szymanski, Bernardus Rendy, Yuxing Fei, Rishi E

    Nathan J. Szymanski, Bernardus Rendy, Yuxing Fei, Rishi E. Kumar, Tanjin He, David Milsted, Matthew J. McDermott, Max Gallant, Ekin Dogus Cubuk, Amil Merchant, Haegyeom Kim, Anubhav Jain, Christopher J. Bartel, Kristin Persson, Yan Zeng, and Gerbrand Ceder. An autonomous laboratory for the accelerated synthesis of novel materials. Nature, 624 0 (7990): 0 ...

  29. [29]

    Miller, Abhishek Charnalia, Derek Dunfield, Carole-Jean Wu, Pontus Stenetorp, Nicola Cancedda, Jakob Nicolaus Foerster, and Yoram Bachrach

    Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Rishi Hazra, Nicolas Baldwin, Alexis Audran-Reiss, Michael Kuchnik, Despoina Magka, Minqi Jiang, Alisia Maria Lupidi, Andrei Lupu, Roberta Raileanu, Kelvin Niu, Tatiana Shavrina, Jean-Christophe Gagnon-Audet, Michael Shvartsman, Shagun Sodhani, Alexander H. Miller, Abhishek Charnalia, Derek Dunfield, Car...

  30. [30]

    Kauwe, Ryan J

    Anthony Yu-Tung Wang, Steven K. Kauwe, Ryan J. Murdock, and Taylor D. Sparks. Compositionally restricted attention-based network for materials property predictions. npj Computational Materials, 7 0 (1): 0 77, 2021. doi:10.1038/s41524-021-00545-1

  31. [31]

    Crystograph: A comprehensive predictive model for crystal material properties and the benchmark

    Hongyi Wang, Ji Sun, Jinzhe Liang, Li Zhai, Zitian Tang, Zijian Li, Wei Zhai, Xusheng Wang, Weihao Gao, and Sheng Gong. Crystograph: A comprehensive predictive model for crystal material properties and the benchmark. Battery Energy, 4 0 (4): 0 e70004, 2025. doi:10.1002/bte2.70004

  32. [32]

    A general-purpose machine learning framework for predicting properties of inorganic materials

    Logan Ward, Ankit Agrawal, Alok Choudhary, and Christopher Wolverton. A general-purpose machine learning framework for predicting properties of inorganic materials. npj Computational Materials, 2 0 (1): 0 16028, 2016. doi:10.1038/npjcompumats.2016.28

  33. [33]

    Grossman

    Tian Xie and Jeffrey C. Grossman. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Physical Review Letters, 120 0 (14): 0 145301, 2018. doi:10.1103/PhysRevLett.120.145301

  34. [34]

    CLOUD : A scalable and physics-informed foundation model for crystal representation learning

    Changwen Xu, Shang Zhu, and Venkatasubramanian Viswanathan. CLOUD : A scalable and physics-informed foundation model for crystal representation learning. Nature Communications, 17 0 (1): 0 4074, 2026. doi:10.1038/s41467-026-70467-3

  35. [35]

    The AI scientist-v2: Workshop-level automated scientific discovery via agentic tree search, April 2025

    Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist-v2: Workshop-level automated scientific discovery via agentic tree search, April 2025

  36. [36]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE -agent: Agent-computer interfaces enable automated software engineering, 2024. NeurIPS 2024