INCARBench: A Benchmark for Scientific Configuration in VASP INCAR by Large Language Models

Baishun Yang; Bin Shao; Jixiang Li; Weichao Wang; Xinyue Zhang; Zhiyang Liu

arxiv: 2606.23571 · v2 · pith:WKDEFPT6new · submitted 2026-06-22 · ❄️ cond-mat.mtrl-sci

INCARBench: A Benchmark for Scientific Configuration in VASP INCAR by Large Language Models

Bin Shao , Jixiang Li , Xinyue Zhang , Baishun Yang , Zhiyang Liu , Weichao Wang This is my paper

Pith reviewed 2026-06-26 07:06 UTC · model grok-4.3

classification ❄️ cond-mat.mtrl-sci

keywords INCARBenchVASPLLM configurationINCARDFT+Utask-critical correctnessbenchmarkmaterials simulation

0 comments

The pith

INCARBench shows that high semantic accuracy in LLM VASP configurations does not ensure scientific validity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces INCARBench to test large language models on generating and repairing VASP INCAR input files for first-principles calculations. It evaluates 19 model setups across generation and repair tasks using metrics for semantic accuracy, policy accuracy, and task-critical correctness. Several models reach strong scores on the first two metrics yet show much lower task-critical correctness. Failures cluster around physically coupled choices such as DFT+U, magnetism, and correlated materials. Repair results indicate that fixing bad settings and preserving already-correct ones are separate skills, with preservation proving especially difficult.

Core claim

Current frontier LLMs can produce VASP INCAR files that satisfy semantic and policy checks yet still fail to meet the stricter standard of task-critical correctness required for scientifically valid simulations. Errors concentrate in settings where multiple physical constraints interact, such as DFT+U combined with magnetism in correlated materials. Repair tasks further separate the ability to correct errors from the ability to leave valid parameters untouched, with the latter remaining a persistent weakness.

What carries the argument

INCARBench benchmark consisting of configuration generation and repair tasks evaluated by semantic accuracy, policy accuracy, and task-critical correctness metrics.

If this is right

Task-critical correctness is a stricter and distinct requirement from semantic or policy accuracy.
Errors are concentrated in physically coupled parameter sets involving DFT+U, magnetism, and correlated materials.
Correcting incorrect settings and preserving already-valid configurations are separate capabilities.
Scientific configuration for computational materials science can be treated as a measurable LLM capability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models may require additional training signals that enforce simultaneous satisfaction of multiple physical constraints rather than isolated parameter rules.
Extending the benchmark to full workflow validation, such as checking whether generated inputs produce stable convergence and expected physical properties, would test real-world utility.
The gap between parameter-level correctness and scientific validity could be addressed by coupling LLMs with lightweight physics checkers during generation.

Load-bearing premise

The chosen generation and repair tasks plus the three accuracy metrics are enough to determine whether a configuration is scientifically valid in actual VASP workflows.

What would settle it

Run the same LLM-generated INCAR files through real VASP calculations on a set of DFT+U magnetic materials and compare outcomes against expert-validated reference results to check whether high benchmark scores predict correct physical outputs.

Figures

Figures reproduced from arXiv: 2606.23571 by Baishun Yang, Bin Shao, Jixiang Li, Weichao Wang, Xinyue Zhang, Zhiyang Liu.

**Figure 2.** Figure 2: Construction of the INCARBench benchmark. The generation task contains 192 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Generation performance on INCARBench. (a) Overall generation score. The broad [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Repair performance on INCARBench. (a) Overall repair score ranked by aggregate [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Material–challenge failure landscape. Cell colors indicate the mean generation score [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly being integrated into first-principles computational workflows, yet their ability to configure scientific calculations remains poorly understood. Here, we introduce INCARBench, a benchmark for evaluating LLMs on input configuration for the Vienna Ab initio Simulation Package (VASP) through both configuration generation and repair tasks. Evaluating 19 model configurations reveals substantial capability differences among current frontier models. While several models achieve high semantic and policy accuracy, task-critical correctness remains substantially lower, demonstrating that parameter-level correctness does not necessarily imply scientifically valid configurations. Failure analysis shows that errors concentrate in physically coupled settings involving DFT+$U$, magnetism, and correlated materials, where multiple constraints must be satisfied simultaneously. Repair evaluation further reveals that correcting incorrect settings and preserving already-valid configurations are distinct capabilities, with configuration preservation remaining a major challenge. These findings establish scientific configuration as a measurable capability of large language models and provide a foundation for developing more reliable AI systems for computational materials science.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

INCARBench is a targeted new benchmark for LLM VASP INCAR generation and repair that usefully separates semantic/policy accuracy from task-critical correctness, though the latter metric needs clearer external checks.

read the letter

The main takeaway is that this paper introduces INCARBench to test LLMs on creating and fixing VASP INCAR files, and shows that high scores on individual parameters and policy rules do not guarantee configurations that satisfy coupled physical constraints like DFT+U with magnetism.

What stands out as new is the focused benchmark on both generation and repair tasks, plus the evaluation across 19 model setups. The split into semantic accuracy, policy accuracy, and task-critical correctness, along with the note that repair requires both fixing bad settings and preserving good ones, gives a practical way to measure progress in this narrow but real workflow step.

The paper does well at highlighting where errors cluster—in settings with multiple interacting constraints—and at framing scientific configuration as something measurable rather than just prompt engineering.

The soft spot is the task-critical correctness definition. If the rules for it overlap with the policy guidelines or come from an internal checklist without separate validation like actual VASP runs or expert sign-off, the reported gap between metrics could partly reflect how the scores were built rather than a pure capability limit. Dataset details and rubric construction are not visible in the abstract, which leaves the representativeness of the test cases open.

This is for people working on AI tools for computational materials who need a concrete testbed for INCAR handling. A reader focused on LLM reliability in DFT workflows would find the task design and failure patterns worth looking at. It deserves peer review because the benchmark idea is concrete and the problem it targets matters, even if the evaluation would benefit from more on metric grounding.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces INCARBench, a benchmark for evaluating LLMs on VASP INCAR configuration generation and repair tasks. It assesses 19 model configurations using semantic accuracy, policy accuracy, and task-critical correctness metrics, reporting that high performance on the first two does not guarantee the third, with errors concentrating in physically coupled regimes (DFT+U, magnetism, correlated materials). Repair tasks further show that error correction and valid-configuration preservation are distinct skills, with the latter remaining challenging.

Significance. If the metrics prove robust and non-circular, the work is significant for establishing a measurable, domain-specific capability gap in LLMs for computational materials science workflows. It supplies a concrete benchmark with failure-mode analysis that can guide targeted improvements, and the separation of generation versus repair tasks plus the emphasis on coupled-parameter constraints represent a useful empirical contribution beyond generic LLM evaluations.

major comments (3)

[Methods] Methods (task-critical correctness definition): The manuscript must explicitly document how the rules underlying task-critical correctness are constructed and whether they are derived independently of the policy guidelines used for policy accuracy. Without this, the central claim that parameter-level correctness does not imply scientific validity risks being partly definitional rather than an empirical demonstration of LLM shortcomings in coupled regimes.
[Results] Results (failure analysis): The statement that errors concentrate in DFT+U, magnetism, and correlated materials lacks quantitative support such as the fraction of test cases involving these regimes, the per-regime error rates, or concrete examples of coupled constraints that were violated. This detail is load-bearing for the claim that failures are regime-specific rather than uniformly distributed.
[Evaluation] Evaluation protocol: No mention is made of external grounding for task-critical correctness (e.g., expert physicist review of a sample of outputs or execution of generated INCAR files in VASP to check for runtime or convergence issues). This absence weakens the assertion that the metric captures scientific validity beyond internal checklist compliance.

minor comments (2)

[Abstract] Abstract: Include the total number of test cases or INCAR instances used in the benchmark to give readers immediate scale context.
Notation: Ensure consistent use of “task-critical correctness” versus any shorthand throughout; minor inconsistencies in abbreviation appear in the provided text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed report. We address each major comment below, indicating planned revisions where appropriate. All changes will be incorporated in a revised manuscript.

read point-by-point responses

Referee: [Methods] Methods (task-critical correctness definition): The manuscript must explicitly document how the rules underlying task-critical correctness are constructed and whether they are derived independently of the policy guidelines used for policy accuracy. Without this, the central claim that parameter-level correctness does not imply scientific validity risks being partly definitional rather than an empirical demonstration of LLM shortcomings in coupled regimes.

Authors: We agree that explicit documentation is required. In the revised manuscript we will add a new subsection in Methods that details the construction of the task-critical correctness rules. These rules were assembled from the official VASP manual, peer-reviewed literature on DFT+U and magnetism, and independent input from two computational materials scientists; they were finalized before the policy-accuracy checklist was written and address physical consistency constraints (e.g., simultaneous satisfaction of ISPIN, MAGMOM, and LDAU parameters) that are orthogonal to the syntactic and formatting rules used for policy accuracy. This addition will make the empirical nature of the observed gap explicit. revision: yes
Referee: [Results] Results (failure analysis): The statement that errors concentrate in DFT+U, magnetism, and correlated materials lacks quantitative support such as the fraction of test cases involving these regimes, the per-regime error rates, or concrete examples of coupled constraints that were violated. This detail is load-bearing for the claim that failures are regime-specific rather than uniformly distributed.

Authors: We accept this criticism and will expand the failure-analysis section. The revision will report: (i) the exact fraction of the 1,200 test cases that involve DFT+U, magnetism, or correlated-electron settings; (ii) task-critical correctness error rates broken down by regime; and (iii) two to three concrete examples of simultaneously violated coupled constraints (e.g., incorrect MAGMOM sign together with missing LDAUL for a transition-metal oxide). These additions will supply the requested quantitative grounding. revision: yes
Referee: [Evaluation] Evaluation protocol: No mention is made of external grounding for task-critical correctness (e.g., expert physicist review of a sample of outputs or execution of generated INCAR files in VASP to check for runtime or convergence issues). This absence weakens the assertion that the metric captures scientific validity beyond internal checklist compliance.

Authors: We acknowledge that the current benchmark relies on rule-based internal validation rather than runtime VASP execution or post-hoc expert review of every output. Performing full DFT runs for thousands of generated INCAR files would have been computationally prohibitive within the scope of this study. In the revised manuscript we will add an explicit limitations paragraph stating this design choice and outlining how future work could incorporate sampled VASP executions or expert adjudication. We therefore treat the requested external grounding as a planned extension rather than a change to the present results. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark evaluation

full rationale

This paper introduces INCARBench as an empirical benchmark for LLM performance on VASP INCAR configuration generation and repair tasks. It reports model accuracies on defined metrics (semantic accuracy, policy accuracy, task-critical correctness) and analyzes failure modes in physically coupled settings. No derivations, equations, fitted parameters, or self-citation chains are present that could reduce any claim to its inputs by construction. The evaluation relies on external model outputs and task definitions without internal reductions or load-bearing self-references. This is a standard self-contained benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no mathematical derivations, fitted parameters, or new physical entities are described.

pith-pipeline@v0.9.1-grok · 5718 in / 1082 out tokens · 16342 ms · 2026-06-26T07:06:17.503159+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 15 canonical work pages

[1]

Ge Lei, Ronan Docherty, and Samuel J. Cooper. Materials science in the era of large language models: a perspective.Digital Discovery, 3:1257–1272, 2024. doi: 10.1039/ D4DD00074A

2024
[2]

Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes

Daniil A. Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chem- ical research with large language models.Nature, 624:570–578, 2023. doi: 10.1038/ s41586-023-06792-0

2023
[3]

Accelerating materials language processing with large language models.Communications Materials, 5:13, 2024

Jaewoong Choi and Byungju Lee. Accelerating materials language processing with large language models.Communications Materials, 5:13, 2024. doi: 10.1038/s43246-024-00449-9

work page doi:10.1038/s43246-024-00449-9 2024
[4]

Rand, and Adji Bousso Dieng

Andre Niyongabo Rubungo, Craig Arnold, Barry P. Rand, and Adji Bousso Dieng. Llm- prop: predicting the properties of crystalline materials using large language models.npj Computational Materials, 11:186, 2025. doi: 10.1038/s41524-025-01536-2

work page doi:10.1038/s41524-025-01536-2 2025
[5]

Nguyen, See-Kiong Ng, and Anh Tuan Luu

Huan Zhang, Yu Song, Ziyu Hou, Santiago Miret, and Bang Liu. HoneyComb: A flex- ible LLM-based agent system for materials science. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 3369–3382, 2024. doi: 10.18653/v1/2024. findings-emnlp.192

work page doi:10.18653/v1/2024 2024
[6]

An agentic framework for autonomous materials computation.arXiv preprint arXiv:2512.19458, 2025

Mingyu Guo et al. An agentic framework for autonomous materials computation.arXiv preprint arXiv:2512.19458, 2025

arXiv 2025
[7]

VASPilot: MCP-facilitated multi-agent intelligence for autonomous VASP.arXiv preprint arXiv:2508.07035, 2025

Zijian Chen et al. VASPilot: MCP-facilitated multi-agent intelligence for autonomous VASP.arXiv preprint arXiv:2508.07035, 2025

arXiv 2025
[8]

Kresse and D

G. Kresse and D. Joubert. From ultrasoft pseudopotentials to the projector augmented- wave method.Physical Review B, 59(3):1758–1775, 1999. doi: 10.1103/PhysRevB.59.1758

work page doi:10.1103/physrevb.59.1758 1999
[9]

Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set

Georg Kresse and Jürgen Furthmüller. Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set.Computational Materials Science, 6(1):15–50, 1996. doi: 10.1016/0927-0256(96)00008-0

work page doi:10.1016/0927-0256(96)00008-0 1996
[10]

Kresse and J

Georg Kresse and Jürgen Furthmüller. Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set.Physical Review B, 54(16):11169–11186, 1996. doi: 10.1103/PhysRevB.54.11169

work page doi:10.1103/physrevb.54.11169 1996
[11]

Incar – vasp wiki, 2025

VASP Development Team. Incar – vasp wiki, 2025. URLhttps://www.vasp.at/wiki/ index.php/INCAR. Accessed: 2026-06-02. 11

2025
[12]

Hobbs, G

D. Hobbs, G. Kresse, and J. Hafner. Fully unconstrained noncollinear magnetism within the projector augmented-wave method.Physical Review B, 62(17):11556–11570, 2000. doi: 10.1103/PhysRevB.62.11556

work page doi:10.1103/physrevb.62.11556 2000
[13]

S. L. Dudarev, G. A. Botton, S. Y. Savrasov, C. J. Humphreys, and A. P. Sutton. Electron- energy-lossspectraandthestructuralstabilityofnickeloxide: AnLSDA+Ustudy.Physical Review B, 57(3):1505–1509, 1998. doi: 10.1103/PhysRevB.57.1505

work page doi:10.1103/physrevb.57.1505 1998
[14]

A. I. Liechtenstein, V. I. Anisimov, and J. Zaanen. Density-functional theory and strong interactions: Orbital ordering in mott-hubbard insulators.Physical Review B, 52(8):R5467– R5470, 1995. doi: 10.1103/PhysRevB.52.R5467

work page doi:10.1103/physrevb.52.r5467 1995
[15]

Stefan Grimme, Jens Antony, Stephan Ehrlich, and Helge Krieg. A consistent and accurate ab initio parametrization of density functional dispersion correction (DFT-D) for the 94 elements H–Pu.The Journal of Chemical Physics, 132(15):154104, 2010. doi: 10.1063/1. 3382344

work page doi:10.1063/1 2010
[16]

M. Dion, H. Rydberg, E. Schröder, D. C. Langreth, and B. I. Lundqvist. Van der Waals density functional for general geometries.Physical Review Letters, 92(24):246401, 2004. doi: 10.1103/PhysRevLett.92.246401

work page doi:10.1103/physrevlett.92.246401 2004
[17]

Openai gpt-5 system card, 2025

OpenAI. Openai gpt-5 system card, 2025. URLhttps://arxiv.org/abs/2601.03267

Pith/arXiv arXiv 2025
[18]

Gemini 2.5 pro

Google DeepMind. Gemini 2.5 pro. Google AI for Developers model documentation, 2025. URLhttps://ai.google.dev/gemini-api/docs/models#gemini-2.5-pro

2025
[19]

Introducing claude 4, 2025

Anthropic. Introducing claude 4, 2025. URLhttps://www.anthropic.com/news/ claude-4

2025
[20]

Deepseek-r1: Incentivizing reasoning capability in large language models via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in large language models via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025
[21]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[22]

Minimax-m1: Scaling test-time compute efficiently with lightning attention

MiniMax. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025

Pith/arXiv arXiv 2025
[23]

Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

Kimi Team and Moonshot AI. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

Pith/arXiv arXiv 2026
[24]

Glm-5: From vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

GLM-5 Team, Jie Tang, et al. Glm-5: From vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

Pith/arXiv arXiv 2026
[25]

MatSciBench: Benchmarking the reasoning ability of large language models in materials science.arXiv preprint arXiv:2510.12171, 2025

Junkai Zhang et al. MatSciBench: Benchmarking the reasoning ability of large language models in materials science.arXiv preprint arXiv:2510.12171, 2025

Pith/arXiv arXiv 2025
[26]

Msqa: Benchmarking llms on graduate-level materials science reasoning and knowledge.arXiv preprint arXiv:2505.23982, 2025

Jerry Junyang Cheung, Shiyao Shen, Yuchen Zhuang, Yinghao Li, Rampi Ramprasad, and Chao Zhang. Msqa: Benchmarking llms on graduate-level materials science reasoning and knowledge.arXiv preprint arXiv:2505.23982, 2025

arXiv 2025
[27]

Llm4mat-bench: Benchmarking large language models for materials property prediction

Andre Niyongabo Rubungo, Kangming Li, Jason Hattrick-Simpers, and Adji Bousso Dieng. Llm4mat-bench: Benchmarking large language models for materials property prediction. arXiv preprint arXiv:2411.00177, 2024

arXiv 2024
[28]

Mattools: Benchmarking large language models for materials science tools

Siyu Liu et al. Mattools: Benchmarking large language models for materials science tools. arXiv preprint arXiv:2505.10852, 2025. 12

arXiv 2025
[29]

LAMBench: a benchmark for large atomistic models.npj Computational Materials, 12:62,

Anyang Peng, Chun Cai, Mingyu Guo, Duo Zhang, Chengqian Zhang, Wanrun Jiang, Yinan Wang, Antoine Loew, Chengkun Wu, Weinan E, Linfeng Zhang, and Han Wang. LAMBench: a benchmark for large atomistic models.npj Computational Materials, 12:62,
[30]

doi: 10.1038/s41524-025-01929-3

work page doi:10.1038/s41524-025-01929-3
[31]

Anubhav Jain, Shyue Ping Ong, Geoffroy Hautier, Wei Chen, William Davidson Richards, Stephen Dacek, Shreyas Cholia, Dan Gunter, David Skinner, Gerbrand Ceder, and Kristin A. Persson. Commentary: The materials project: A materials genome ap- proach to accelerating materials innovation.APL Materials, 1(1):011002, 2013. doi: 10.1063/1.4812323

work page doi:10.1063/1.4812323 2013
[32]

K.; Huck, P.; Yang, R

Matthew K. Horton, Patrick Huck, Ruo Xi Yang, Jason M. Munro, Shyam Dwaraknath, Alex M. Ganose, Ryan S. Kingsbury, Mingjian Wen, Jimmy X. Shen, Tyler S. Mathis, Aaron D. Kaplan, Karlo Berket, Janosh Riebesell, Janine George, Andrew S. Rosen, Evan W. C. Spotte-Smith, Matthew J. McDermott, Orion A. Cohen, Alex Dunn, Matthew C. Kuner, Gian-Marco Rignanese, G...

work page doi:10.1038/s41563-025-02272-0 2025
[33]

Computational Materials Science , volume =

Shyue Ping Ong, William Davidson Richards, Anubhav Jain, Geoffroy Hautier, Michael Kocher, Shreyas Cholia, Dan Gunter, Vincent L. Chevrier, Kristin A. Persson, and Ger- brand Ceder. Python materials genomics (pymatgen): A robust, open-source python li- brary for materials analysis.Computational Materials Science, 68:314–319, 2013. doi: 10.1016/j.commatsci...

work page doi:10.1016/j.commatsci.2012.10.028 2013
[34]

Claude models overview

Anthropic. Claude models overview. Claude API documentation, 2026. URLhttps: //platform.claude.com/docs/en/about-claude/models/overview

2026
[35]

Introducing claude haiku 4.5, 2025

Anthropic. Introducing claude haiku 4.5, 2025. URLhttps://www.anthropic.com/news/ claude-haiku-4-5

2025
[36]

Gemini 3.1 pro

Google DeepMind. Gemini 3.1 pro. Google AI for Developers model documentation, 2026. URLhttps://ai.google.dev/gemini-api/docs/models#gemini-3.1-pro

2026
[37]

Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024
[38]

Qwen3-coder: Agentic coding in the world, 2025

Qwen Team. Qwen3-coder: Agentic coding in the world, 2025. URLhttps://qwenlm. github.io/blog/qwen3-coder/

2025
[39]

The minimax-m2 series: Mini activations unleashing max real-world intelligence

MiniMax. The minimax-m2 series: Mini activations unleashing max real-world intelligence. arXiv preprint arXiv:2605.26494, 2026

Pith/arXiv arXiv 2026
[40]

The llama 4 herd: The beginning of a new era of natively multimodal ai innova- tion, 2025

Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai innova- tion, 2025. URLhttps://ai.meta.com/blog/llama-4-multimodal-intelligence/. 13

2025

[1] [1]

Ge Lei, Ronan Docherty, and Samuel J. Cooper. Materials science in the era of large language models: a perspective.Digital Discovery, 3:1257–1272, 2024. doi: 10.1039/ D4DD00074A

2024

[2] [2]

Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes

Daniil A. Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chem- ical research with large language models.Nature, 624:570–578, 2023. doi: 10.1038/ s41586-023-06792-0

2023

[3] [3]

Accelerating materials language processing with large language models.Communications Materials, 5:13, 2024

Jaewoong Choi and Byungju Lee. Accelerating materials language processing with large language models.Communications Materials, 5:13, 2024. doi: 10.1038/s43246-024-00449-9

work page doi:10.1038/s43246-024-00449-9 2024

[4] [4]

Rand, and Adji Bousso Dieng

Andre Niyongabo Rubungo, Craig Arnold, Barry P. Rand, and Adji Bousso Dieng. Llm- prop: predicting the properties of crystalline materials using large language models.npj Computational Materials, 11:186, 2025. doi: 10.1038/s41524-025-01536-2

work page doi:10.1038/s41524-025-01536-2 2025

[5] [5]

Nguyen, See-Kiong Ng, and Anh Tuan Luu

Huan Zhang, Yu Song, Ziyu Hou, Santiago Miret, and Bang Liu. HoneyComb: A flex- ible LLM-based agent system for materials science. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 3369–3382, 2024. doi: 10.18653/v1/2024. findings-emnlp.192

work page doi:10.18653/v1/2024 2024

[6] [6]

An agentic framework for autonomous materials computation.arXiv preprint arXiv:2512.19458, 2025

Mingyu Guo et al. An agentic framework for autonomous materials computation.arXiv preprint arXiv:2512.19458, 2025

arXiv 2025

[7] [7]

VASPilot: MCP-facilitated multi-agent intelligence for autonomous VASP.arXiv preprint arXiv:2508.07035, 2025

Zijian Chen et al. VASPilot: MCP-facilitated multi-agent intelligence for autonomous VASP.arXiv preprint arXiv:2508.07035, 2025

arXiv 2025

[8] [8]

Kresse and D

G. Kresse and D. Joubert. From ultrasoft pseudopotentials to the projector augmented- wave method.Physical Review B, 59(3):1758–1775, 1999. doi: 10.1103/PhysRevB.59.1758

work page doi:10.1103/physrevb.59.1758 1999

[9] [9]

Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set

Georg Kresse and Jürgen Furthmüller. Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set.Computational Materials Science, 6(1):15–50, 1996. doi: 10.1016/0927-0256(96)00008-0

work page doi:10.1016/0927-0256(96)00008-0 1996

[10] [10]

Kresse and J

Georg Kresse and Jürgen Furthmüller. Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set.Physical Review B, 54(16):11169–11186, 1996. doi: 10.1103/PhysRevB.54.11169

work page doi:10.1103/physrevb.54.11169 1996

[11] [11]

Incar – vasp wiki, 2025

VASP Development Team. Incar – vasp wiki, 2025. URLhttps://www.vasp.at/wiki/ index.php/INCAR. Accessed: 2026-06-02. 11

2025

[12] [12]

Hobbs, G

D. Hobbs, G. Kresse, and J. Hafner. Fully unconstrained noncollinear magnetism within the projector augmented-wave method.Physical Review B, 62(17):11556–11570, 2000. doi: 10.1103/PhysRevB.62.11556

work page doi:10.1103/physrevb.62.11556 2000

[13] [13]

S. L. Dudarev, G. A. Botton, S. Y. Savrasov, C. J. Humphreys, and A. P. Sutton. Electron- energy-lossspectraandthestructuralstabilityofnickeloxide: AnLSDA+Ustudy.Physical Review B, 57(3):1505–1509, 1998. doi: 10.1103/PhysRevB.57.1505

work page doi:10.1103/physrevb.57.1505 1998

[14] [14]

A. I. Liechtenstein, V. I. Anisimov, and J. Zaanen. Density-functional theory and strong interactions: Orbital ordering in mott-hubbard insulators.Physical Review B, 52(8):R5467– R5470, 1995. doi: 10.1103/PhysRevB.52.R5467

work page doi:10.1103/physrevb.52.r5467 1995

[15] [15]

Stefan Grimme, Jens Antony, Stephan Ehrlich, and Helge Krieg. A consistent and accurate ab initio parametrization of density functional dispersion correction (DFT-D) for the 94 elements H–Pu.The Journal of Chemical Physics, 132(15):154104, 2010. doi: 10.1063/1. 3382344

work page doi:10.1063/1 2010

[16] [16]

M. Dion, H. Rydberg, E. Schröder, D. C. Langreth, and B. I. Lundqvist. Van der Waals density functional for general geometries.Physical Review Letters, 92(24):246401, 2004. doi: 10.1103/PhysRevLett.92.246401

work page doi:10.1103/physrevlett.92.246401 2004

[17] [17]

Openai gpt-5 system card, 2025

OpenAI. Openai gpt-5 system card, 2025. URLhttps://arxiv.org/abs/2601.03267

Pith/arXiv arXiv 2025

[18] [18]

Gemini 2.5 pro

Google DeepMind. Gemini 2.5 pro. Google AI for Developers model documentation, 2025. URLhttps://ai.google.dev/gemini-api/docs/models#gemini-2.5-pro

2025

[19] [19]

Introducing claude 4, 2025

Anthropic. Introducing claude 4, 2025. URLhttps://www.anthropic.com/news/ claude-4

2025

[20] [20]

Deepseek-r1: Incentivizing reasoning capability in large language models via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in large language models via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025

[21] [21]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[22] [22]

Minimax-m1: Scaling test-time compute efficiently with lightning attention

MiniMax. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025

Pith/arXiv arXiv 2025

[23] [23]

Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

Kimi Team and Moonshot AI. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

Pith/arXiv arXiv 2026

[24] [24]

Glm-5: From vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

GLM-5 Team, Jie Tang, et al. Glm-5: From vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

Pith/arXiv arXiv 2026

[25] [25]

MatSciBench: Benchmarking the reasoning ability of large language models in materials science.arXiv preprint arXiv:2510.12171, 2025

Junkai Zhang et al. MatSciBench: Benchmarking the reasoning ability of large language models in materials science.arXiv preprint arXiv:2510.12171, 2025

Pith/arXiv arXiv 2025

[26] [26]

Msqa: Benchmarking llms on graduate-level materials science reasoning and knowledge.arXiv preprint arXiv:2505.23982, 2025

Jerry Junyang Cheung, Shiyao Shen, Yuchen Zhuang, Yinghao Li, Rampi Ramprasad, and Chao Zhang. Msqa: Benchmarking llms on graduate-level materials science reasoning and knowledge.arXiv preprint arXiv:2505.23982, 2025

arXiv 2025

[27] [27]

Llm4mat-bench: Benchmarking large language models for materials property prediction

Andre Niyongabo Rubungo, Kangming Li, Jason Hattrick-Simpers, and Adji Bousso Dieng. Llm4mat-bench: Benchmarking large language models for materials property prediction. arXiv preprint arXiv:2411.00177, 2024

arXiv 2024

[28] [28]

Mattools: Benchmarking large language models for materials science tools

Siyu Liu et al. Mattools: Benchmarking large language models for materials science tools. arXiv preprint arXiv:2505.10852, 2025. 12

arXiv 2025

[29] [29]

LAMBench: a benchmark for large atomistic models.npj Computational Materials, 12:62,

Anyang Peng, Chun Cai, Mingyu Guo, Duo Zhang, Chengqian Zhang, Wanrun Jiang, Yinan Wang, Antoine Loew, Chengkun Wu, Weinan E, Linfeng Zhang, and Han Wang. LAMBench: a benchmark for large atomistic models.npj Computational Materials, 12:62,

[30] [30]

doi: 10.1038/s41524-025-01929-3

work page doi:10.1038/s41524-025-01929-3

[31] [31]

Anubhav Jain, Shyue Ping Ong, Geoffroy Hautier, Wei Chen, William Davidson Richards, Stephen Dacek, Shreyas Cholia, Dan Gunter, David Skinner, Gerbrand Ceder, and Kristin A. Persson. Commentary: The materials project: A materials genome ap- proach to accelerating materials innovation.APL Materials, 1(1):011002, 2013. doi: 10.1063/1.4812323

work page doi:10.1063/1.4812323 2013

[32] [32]

K.; Huck, P.; Yang, R

Matthew K. Horton, Patrick Huck, Ruo Xi Yang, Jason M. Munro, Shyam Dwaraknath, Alex M. Ganose, Ryan S. Kingsbury, Mingjian Wen, Jimmy X. Shen, Tyler S. Mathis, Aaron D. Kaplan, Karlo Berket, Janosh Riebesell, Janine George, Andrew S. Rosen, Evan W. C. Spotte-Smith, Matthew J. McDermott, Orion A. Cohen, Alex Dunn, Matthew C. Kuner, Gian-Marco Rignanese, G...

work page doi:10.1038/s41563-025-02272-0 2025

[33] [33]

Computational Materials Science , volume =

Shyue Ping Ong, William Davidson Richards, Anubhav Jain, Geoffroy Hautier, Michael Kocher, Shreyas Cholia, Dan Gunter, Vincent L. Chevrier, Kristin A. Persson, and Ger- brand Ceder. Python materials genomics (pymatgen): A robust, open-source python li- brary for materials analysis.Computational Materials Science, 68:314–319, 2013. doi: 10.1016/j.commatsci...

work page doi:10.1016/j.commatsci.2012.10.028 2013

[34] [34]

Claude models overview

Anthropic. Claude models overview. Claude API documentation, 2026. URLhttps: //platform.claude.com/docs/en/about-claude/models/overview

2026

[35] [35]

Introducing claude haiku 4.5, 2025

Anthropic. Introducing claude haiku 4.5, 2025. URLhttps://www.anthropic.com/news/ claude-haiku-4-5

2025

[36] [36]

Gemini 3.1 pro

Google DeepMind. Gemini 3.1 pro. Google AI for Developers model documentation, 2026. URLhttps://ai.google.dev/gemini-api/docs/models#gemini-3.1-pro

2026

[37] [37]

Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024

[38] [38]

Qwen3-coder: Agentic coding in the world, 2025

Qwen Team. Qwen3-coder: Agentic coding in the world, 2025. URLhttps://qwenlm. github.io/blog/qwen3-coder/

2025

[39] [39]

The minimax-m2 series: Mini activations unleashing max real-world intelligence

MiniMax. The minimax-m2 series: Mini activations unleashing max real-world intelligence. arXiv preprint arXiv:2605.26494, 2026

Pith/arXiv arXiv 2026

[40] [40]

The llama 4 herd: The beginning of a new era of natively multimodal ai innova- tion, 2025

Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai innova- tion, 2025. URLhttps://ai.meta.com/blog/llama-4-multimodal-intelligence/. 13

2025