ARIA: A Causal-Aware Framework for Rescuing LLM Reasoning in Trustworthy Materials Discovery

Alan Yuille; Benjamin Van Durme; Jieneng Chen; Liaoyaqi Wang; Paulette Clancy; Yi Cao

arxiv: 2606.22375 · v1 · pith:GV6D7VLUnew · submitted 2026-06-21 · 💻 cs.AI · cs.CE· cs.IR

ARIA: A Causal-Aware Framework for Rescuing LLM Reasoning in Trustworthy Materials Discovery

Yi Cao , Liaoyaqi Wang , Jieneng Chen , Benjamin Van Durme , Alan Yuille , Paulette Clancy This is my paper

Pith reviewed 2026-06-26 11:07 UTC · model grok-4.3

classification 💻 cs.AI cs.CEcs.IR

keywords ARIAcontextual tunnelingLLM reasoningmaterials discoveryknowledge graphcausal reasoningPSP relations2D materials

0 comments

The pith

ARIA routes LLM material queries through a three-tier causal cascade based on evidence completeness to avoid contextual tunneling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies contextual tunneling in LLM and knowledge-graph systems for materials discovery, where models latch onto narrow retrieved facts and suppress broader physical reasoning. ARIA counters this by building a graph of 2,839 process-structure-property relations and routing each query according to how complete the causal chain is. Full chains trigger direct causal reasoning, sparse cases trigger physics-informed analogy, and missing evidence triggers a parametric fallback. On forward prediction and inverse design for two-dimensional materials, the approach outperforms plain and naive graph-augmented baselines and adds traceable reasoning steps when literature is further enriched. The result is a method that keeps AI-assisted discovery anchored in physical causality rather than surface patterns.

Core claim

The central claim is that conditioning the use of retrieved knowledge on the mechanistic completeness of Process-Structure-Property evidence chains, through an explicit three-tier routing cascade, mitigates contextual tunneling in LLMs, yields measurable gains on prediction and design tasks for 2D materials, and produces auditable causal traces that support trustworthy discovery.

What carries the argument

The three-tier cascade that selects direct causal reasoning, physics-informed analogical transfer, or parametric fallback according to the completeness of available PSP evidence chains.

If this is right

Performance improves over both unaugmented LLMs and naive knowledge-graph augmentation on forward prediction and inverse design for 2D materials.
Additional gains appear when the framework is paired with online literature search for evidence enrichment.
The output includes explicit causal traces that link predictions to specific process-structure-property relations.
The same routing logic can be applied to any domain that can supply mechanistic evidence chains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The routing logic could be tested on three-dimensional or functional materials to check whether the tier decisions remain reliable outside the 2D proof-of-concept set.
If the cascade generalizes, it might serve as a template for causal guardrails in other LLM applications that mix literature retrieval with physical modeling.
An open question left implicit is whether the analogical-transfer tier can be made fully automatic or still requires human oversight for novel chemistries.

Load-bearing premise

The system can correctly decide which of the three tiers applies to a given query without introducing new routing errors for unfamiliar materials.

What would settle it

A documented case in which the routing logic assigns a novel material system to the wrong tier and the resulting prediction is contradicted by an independent physics-based simulation or experiment.

Figures

Figures reproduced from arXiv: 2606.22375 by Alan Yuille, Benjamin Van Durme, Jieneng Chen, Liaoyaqi Wang, Paulette Clancy, Yi Cao.

**Figure 2.** Figure 2: Schematic of the ARIA framework for bidirectional reasoning in materials discovery. The framework predicts material properties from synthesis parameters in forward tasks, while enabling inverse design by generating synthesis protocols from target properties. Evaluated on expert-validated materials synthesis tasks spanning forward prediction and inverse design, ARIA delivers three main contributions: (1) P… view at source ↗

**Figure 3.** Figure 3: Schematic of the ARIA Model Architecture S ∗ ), both under feasibility constraints (e.g., stability windows, precursor compatibility). The full formalization (objective functions, constraint sets, and examples) is provided in the Supplementary Information (SI). 3.2 Contextual Tunneling in PSP Reasoning To make contextual tunneling operational, we define when retrieved evidence is PSP-complete. For exampl… view at source ↗

**Figure 4.** Figure 4: Knowledge graph construction pipeline and workflow for materials processing pathway prediction. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Comprehensive evaluation of LLM-based scientific reasoning and design. (a) Radar plot summarizing performance [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Generative models have revolutionized the process of materials discovery, yet they often fail to satisfy underlying physical causality. Through an analysis of Large Language Models (LLMs) augmented with knowledge graphs derived from current literature, we uncover a phenomenon termed contextual tunneling, where models "over-anchor" on narrow, retrieved evidence while suppressing global physical reasoning. To address this problem, we introduce ARIA, a causal-aware framework that conditions knowledge use on mechanistic completeness. ARIA routes each query through a three-tier cascade: (i) direct causal reasoning when complete evidence chains of Process-Structure-Property (PSP) are available, (ii) physics-informed analogical transfer for sparse or novel material systems, and (iii) explicit parametric fallback when external evidence is incomplete. As a proof of concept, we construct a Knowledge Graph (KG) containing 2,839 extracted PSP relations from peer-reviewed articles in the materials literature and evaluate ARIA on forward prediction and inverse design tasks for two-dimensional (2D) materials. ARIA mitigates contextual tunneling, improves over unaugmented and naive KG-augmented baselines, and provides further gains when an online literature search is used for evidence enrichment. Crucially, ARIA produces auditable causal traces, enabling physically grounded and trustworthy AI-assisted materials discovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARIA frames a three-tier routing cascade to counter contextual tunneling in LLM materials tasks but the abstract shows no numbers or routing details.

read the letter

The main thing to know is that this paper names contextual tunneling as a failure mode where LLMs over-anchor on narrow retrieved evidence and then offers ARIA as a fix that routes queries by how complete the process-structure-property chains are.

What is new is the specific three-tier cascade: direct causal reasoning on full evidence, physics-informed analogy on sparse cases, and parametric fallback otherwise. They built a KG of 2,839 PSP relations from the literature and tested the idea on forward prediction and inverse design for 2D materials, plus an online search variant. The claim that it produces auditable causal traces is a concrete plus for trust.

The paper does a reasonable job spotting a practical problem with naive KG augmentation and sketching a structured response. The motivation around mechanistic completeness is straightforward and the auditable traces could be useful in a domain where physical grounding matters.

The soft spots are the missing pieces. The abstract contains no quantitative results, no baseline scores, and no description of how the routing decision itself is implemented or validated. That leaves the stress-test concern intact: without knowing whether routing uses a prompt, a density heuristic, or something else, it is hard to tell whether gains come from the causal conditioning or from other prompt effects. The evaluation stays narrow to 2D materials and a modest KG, so generalization remains open.

This is for researchers working on LLM augmentation for materials discovery who care about reliability. A reader already thinking about causal routing or knowledge-graph hybrids might pick up the framing.

It deserves a serious referee because the problem is real and the proposed structure is specific enough to evaluate. I would send it out once the full experiments and routing details are visible.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ARIA, a causal-aware framework for LLM-based materials discovery that identifies 'contextual tunneling' (over-anchoring on narrow retrieved evidence) in KG-augmented LLMs. It proposes routing queries through a three-tier cascade—direct causal reasoning on complete PSP chains, physics-informed analogical transfer for sparse cases, or parametric fallback—using a KG of 2,839 PSP relations extracted from literature. The framework is evaluated as a proof of concept on forward prediction and inverse design tasks for 2D materials, claiming mitigation of tunneling, gains over baselines (including with online search enrichment), and production of auditable causal traces for trustworthy discovery.

Significance. If the performance gains and auditable traces are shown to arise specifically from the causal routing rather than prompt artifacts, the work could meaningfully advance trustworthy AI for scientific discovery by linking LLM outputs to mechanistic PSP reasoning. The construction of a domain KG and emphasis on falsifiable physical grounding are positive elements. However, the absence of any quantitative results, baseline details, or routing implementation in the provided abstract leaves the actual significance unassessable from the current text.

major comments (2)

[Abstract] Abstract: the central claim that ARIA's improvements stem from correctly routing on 'mechanistic completeness' of PSP evidence chains cannot be evaluated because the manuscript provides no description of the routing decision procedure (e.g., LLM prompt, KG subgraph heuristic, or classifier). This mechanism is load-bearing for distinguishing causal conditioning from prompt-engineering effects and for validating generalization on novel 2D materials.
[Abstract] Abstract: no quantitative results, baseline specifications, evaluation metrics, or error analysis are reported despite claims of improvement over unaugmented and naive KG-augmented baselines. This prevents assessment of whether the three-tier cascade delivers the stated gains or whether routing errors introduce new biases.

minor comments (1)

[Abstract] The term 'contextual tunneling' is introduced without a formal definition or citation to related concepts in LLM reasoning literature; a brief comparison to known issues such as hallucination or retrieval bias would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the abstract. These points identify opportunities to strengthen the presentation of the routing mechanism and evaluation details. We address each comment below and will revise the abstract to incorporate the requested clarifications while preserving its conciseness.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that ARIA's improvements stem from correctly routing on 'mechanistic completeness' of PSP evidence chains cannot be evaluated because the manuscript provides no description of the routing decision procedure (e.g., LLM prompt, KG subgraph heuristic, or classifier). This mechanism is load-bearing for distinguishing causal conditioning from prompt-engineering effects and for validating generalization on novel 2D materials.

Authors: We agree that the abstract would benefit from an explicit statement of the routing decision procedure to allow evaluation of the causal conditioning claim. The full manuscript (Section 3) specifies that routing is implemented via a KG subgraph heuristic that checks for the existence of complete PSP chains; queries with full chains route to direct causal reasoning, partial chains to analogical transfer, and absent chains to parametric fallback. We will revise the abstract to include a concise description of this heuristic, e.g., 'Routing decisions are made by KG subgraph queries assessing PSP chain completeness.' This addition will clarify the distinction from prompt engineering without altering the manuscript's technical content. revision: yes
Referee: [Abstract] Abstract: no quantitative results, baseline specifications, evaluation metrics, or error analysis are reported despite claims of improvement over unaugmented and naive KG-augmented baselines. This prevents assessment of whether the three-tier cascade delivers the stated gains or whether routing errors introduce new biases.

Authors: The current abstract summarizes the proof-of-concept evaluation at a high level to respect length limits. The full manuscript (Section 4) reports quantitative results, including accuracy and success-rate metrics on forward prediction and inverse design tasks for 2D materials, with explicit baselines (unaugmented LLM and naive KG-augmented LLM) and error analysis. To enable assessment from the abstract alone, we will add key quantitative highlights and metric names in the revision, e.g., 'ARIA yields 18% higher accuracy than baselines on forward prediction with online enrichment.' This addresses the concern while keeping the abstract focused. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework is descriptive without load-bearing derivations or self-referential reductions.

full rationale

The paper introduces ARIA as a conceptual routing framework for LLM reasoning over a materials KG. No equations, parameter fits, or derivation chains appear in the provided abstract or description. The three-tier cascade is presented as a design choice conditioned on 'mechanistic completeness,' but without any quoted self-definition, fitted prediction, or self-citation that reduces the central claim to its inputs by construction. The KG construction (2,839 relations) and evaluation tasks are external to any internal loop. This is a standard non-finding for a high-level systems paper lacking mathematical structure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of contextual tunneling as a widespread issue and the effectiveness of evidence-based routing; the KG construction from literature is a key unverified step. Only abstract available so ledger is minimal.

axioms (1)

domain assumption Process-Structure-Property (PSP) relations can form complete evidence chains that enable direct causal reasoning
Invoked to define the first tier of the cascade.

invented entities (1)

contextual tunneling no independent evidence
purpose: Names the observed phenomenon of LLMs over-anchoring on narrow retrieved evidence while suppressing global physical reasoning
Term introduced by the authors to describe the failure mode identified in their analysis.

pith-pipeline@v0.9.1-grok · 5782 in / 1405 out tokens · 30656 ms · 2026-06-26T11:07:24.491814+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 10 canonical work pages · 1 internal anchor

[1]

Alfonso Amayuelas, Joy Sain, Simerjot Kaur, and Charese Smiley. 2025. Ground- ing llm reasoning with knowledge graphs. (2025). https://arxiv.org/abs/2502.13 247 arXiv: 2502.13247[cs.CL]

arXiv 2025
[2]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avi Sil, and Hannaneh Hajishirzi. 2024. Self-rag: learning to retrieve, generate, and critique through self-reflection. In International conference on learning representations. Vol. 2024, 9112–9141

2024
[3]

Adib Bazgir, Yuwen Zhang, et al. 2025. Proteinhypothesis: a physics-aware chain of multi-agent rag llm for hypothesis generation in protein science. In Towards Agentic AI for Science: Hypothesis Generation, Comprehension, Quan- tification, and Validation

2025
[4]

ChemCrow: Augmenting large-language models with chemistry tools

Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D. White, and Philippe Schwaller. 2023. ChemCrow: Augmenting large-language models with chemistry tools. arXiv:2304.05376 [physics]. (Oct. 2023). doi:10.48550/ar Xiv.2304.05376

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/ar 2023
[5]

Tom Brown et al. 2020. Language models are few-shot learners.Advances in neural information processing systems, 33, 1877–1901

2020
[6]

Benjamin Burger et al. 2020. A mobile robotic chemist.Nature, 583, 7815, 237– 241

2020
[7]

Butler, Daniel W

Keith T. Butler, Daniel W. Davies, Hugh Cartwright, Olexandr Isayev, and Aron Walsh. 2018. Machine learning for molecular and materials science. en. Nature, 559, 7715, (July 2018), 547–555. Publisher: Nature Publishing Group. doi:10.1038/s41586-018-0337-2

work page doi:10.1038/s41586-018-0337-2 2018
[8]

Zi-Yi Chen, Fan-Kai Xie, Meng Wan, Yang Yuan, Miao Liu, Zong-Guo Wang, Sheng Meng, and Yan-Gang Wang. 2023. MatChat: A large language model and application service platform for materials science. en.Chinese Physics B, 32, 11, (Nov. 2023), 118104. Publisher: Chinese Physical Society and IOP Publishing Ltd. doi:10.1088/1674-1056/ad04cb

work page doi:10.1088/1674-1056/ad04cb 2023
[9]

Stefano Curtarolo, Gus LW Hart, Marco Buongiorno Nardelli, Natalio Mingo, Stefano Sanvito, and Ohad Levy. 2013. The high-throughput highway to com- putational materials design.Nature materials, 12, 3, 191–201

2013
[10]

White et al

Andrew D. White et al. 2023. Assessment of chemistry knowledge in large language models that generate code. en, (Apr. 2023). Publisher: Royal Society of Chemistry. doi:10.1039/D2DD00087C

work page doi:10.1039/d2dd00087c 2023
[11]

Rosen, Gerbrand Ceder, Kristin A

John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S. Rosen, Gerbrand Ceder, Kristin A. Persson, and Anubhav Jain. 2024. Structured information extraction from scientific text with large language models. en. Nature Communications, 15, 1, (Feb. 2024), 1418. Publisher: Nature Publishing Group. doi:10.1038/s41467-024-45563-x

work page doi:10.1038/s41467-024-45563-x 2024
[12]

Darren Edge et al. 2025. From local to global: a graph rag approach to query- focused summarization. (2025). https : / / arxiv . org / abs / 2404 . 16130 arXiv: 2404.16130[cs.CL]

Pith/arXiv arXiv 2025
[13]

Ali Essam Ghareeb et al. 2026. A multi-agent system for automating scientific discovery.Nature, 1–3

2026
[14]

Tanishq Gupta, Mohd Zaki, N. M. Anoop Krishnan, and Mausam. 2022. MatSciB- ERT: A materials domain language model for text mining and information extraction. en.npj Computational Materials, 8, 1, (May 2022), 102. doi:10.1038/s 41524-022-00784-w

work page doi:10.1038/s 2022
[15]

Jiashu He, Mingyu Derek Ma, Jinxuan Fan, Dan Roth, Wei Wang, and Alejandro Ribeiro. 2024. Give: structured reasoning of large language models with knowl- edge graph inspired veracity extrapolation.arXiv preprint arXiv:2410.08475

arXiv 2024
[16]

Ziyang Huang et al. 2026. Can coding agents reproduce findings in computa- tional materials science?arXiv preprint arXiv:2605.00803

Pith/arXiv arXiv 2026
[17]

Kevin Maik Jablonka, Philippe Schwaller, Andres Ortega-Guerrero, and Berend Smit. 2024. Leveraging large language models for predictive chemistry.Nature Machine Intelligence, 6, 2, 161–169

2024
[18]

Anubhav Jain et al. 2013. Commentary: the materials project: a materials genome approach to accelerating materials innovation.APL materials, 1, 1

2013
[19]

Xue Jiang, Weiren Wang, Shaohan Tian, Hao Wang, Turab Lookman, and Yan- jing Su. 2025. Applications of natural language processing and large language models in materials discovery. en.npj Computational Materials, 11, 1, (Mar. 2025), 79. Publisher: Nature Publishing Group. doi:10.1038/s41524-025-01554-0

work page doi:10.1038/s41524-025-01554-0 2025
[20]

Emre Kiciman, Robert Ness, Amit Sharma, and Chenhao Tan. 2023. Causal reasoning and large language models: opening a new frontier for causality. Transactions on Machine Learning Research

2023
[21]

Edward Kim et al. 2020. Inorganic Materials Synthesis Planning with Literature- Trained Neural Networks.Journal of Chemical Information and Modeling, 60, 3, (Mar. 2020), 1194–1201. Publisher: American Chemical Society. doi:10.1021/acs .jcim.9b00995

work page doi:10.1021/acs 2020
[22]

Patrick Lewis et al. 2020. Retrieval-augmented generation for knowledge- intensive nlp tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems(NIPS ’20) Article 793. Curran Associates Inc., Vancouver, BC, Canada, 16 pages.isbn: 9781713829546

2020
[23]

Songsong Li, Edward R Jira, Nicholas H Angello, Jialing Li, Hao Yu, Jeffrey S Moore, Ying Diao, Martin D Burke, and Charles M Schroeder. 2022. Using automated synthesis to understand the role of side chains on molecular charge transport.Nature communications, 13, 1, 2102

2022
[24]

Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. 2021. Entity-based knowledge conflicts in question answer- ing. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, (Eds.) Association for Computatio...

2021
[25]

Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. 2026. Towards end-to-end automation of ai research.Nature, 651, 8107, 914–919

2026
[26]

Juhwan Noh, Jaehoon Kim, Helge S Stein, Benjamin Sanchez-Lengeling, John M Gregoire, Alan Aspuru-Guzik, and Yousung Jung. 2019. Inverse design of solid-state materials via a continuous representation.Matter, 1, 5, 1370–1384

2019
[27]

Gregory B Olson. 1997. Computational design of hierarchically structured materials.Science, 277, 5330, 1237–1242

1997
[28]

Maitreyee Sharma Priyadarshini, Oluwaseun Romiluyi, Yiran Wang, Kumar Miskin, Connor Ganley, and Paulette Clancy. 2024. Pal 2.0: a physics-driven bayesian optimization framework for material discovery.Materials Horizons, 11, 3, 781–791

2024
[29]

Sreenivas Raguraman, Adam Griebel, Maitreyee Sharma Priyadharshini, Paulette Clancy, and Timothy P Weihs. 2025. A call to elevate the role of processing in ai-driven materials design.Nature Reviews Materials, 1–2

2025
[30]

Aritra Roy et al. 2026. From knowledge to action: outcomes of the 2025 large language model (llm) hackathon for applications in materials science and chemistry.arXiv preprint arXiv:2605.03205

Pith/arXiv arXiv 2026
[31]

Jonathan Schmidt, Mário RG Marques, Silvana Botti, and Miguel AL Marques
[32]

Recent advances and applications of machine learning in solid-state materials science.npj computational materials, 5, 1, 83
[33]

Nathan J Szymanski et al. 2023. An autonomous laboratory for the accelerated synthesis of inorganic materials.Nature, 624, 7990, 86

2023
[34]

Amalie Trewartha et al. 2022. Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science.Patterns, 3, 4

2022
[35]

Chengrui Wang, Qingqing Long, Meng Xiao, Xunxin Cai, Chengjun Wu, Zhen Meng, Xuezhi Wang, and Yuanchun Zhou. 2024. Biorag: a rag-llm framework for biological question reasoning.arXiv preprint arXiv:2408.01107

arXiv 2024
[36]

Fengli Xu et al. 2025. Towards large reasoning models: a survey of reinforced reasoning with large language models. (2025). https://arxiv.org/abs/2501.09686 arXiv: 2501.09686[cs.AI]

Pith/arXiv arXiv 2025
[37]

Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. 2024. Knowledge conflicts for LLMs: a survey. InPro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, (Eds.) Asso- ciation for Computational Linguistics, Miami, Florida, USA, (Nov. 202...

2024
[38]

doi:10.18653/v1/2024.emnlp-main.486

work page doi:10.18653/v1/2024.emnlp-main.486 2024
[39]

Ori Yoran, Tomer Wolfson, Ori Ram, and Jonathan Berant. 2024. Making retrieval-augmented language models robust to irrelevant context. InThe Twelfth International Conference on Learning Representations. https://openrevie w.net/forum?id=ZS4m74kZpH

2024
[40]

Jinglan Zhang, Xinyi Chen, Xu Ye, Yulin Yang, and Bin Ai. 2025. Large language model in materials science: roles, challenges, and strategic outlook.Advanced Intelligent Discovery, 202500085

2025
[41]

Yuzhe Zhang, Yipeng Zhang, Yidong Gan, Lina Yao, and Chen Wang. 2024. Causal graph discovery with retrieval-augmented generation based large lan- guage models. (2024). https://arxiv.org/abs/2402.15301 arXiv: 2402.15301 [cs.CL]

arXiv 2024
[42]

Yizhen Zheng, Huan Yee Koh, Jiaxin Ju, Madeleine Yang, Lauren T May, Geof- frey I Webb, Li Li, Shirui Pan, and George Church. 2025. Large language models for drug discovery and development.Patterns, 6, 10

2025
[43]

Use ONLY the verified causal mechanisms below

Zhiling Zheng, Oufan Zhang, Christian Borgs, Jennifer T. Chayes, and Omar M. Yaghi. 2023. ChatGPT Chemistry Assistant for Text Mining and the Prediction of MOF Synthesis.Journal of the American Chemical Society, 145, 32, (Aug. 2023), 18048–18062. Publisher: American Chemical Society. doi:10.1021/jacs.3c05819. A Algorithmic Details ofARIA This appendix pro...

work page doi:10.1021/jacs.3c05819 2023

[1] [1]

Alfonso Amayuelas, Joy Sain, Simerjot Kaur, and Charese Smiley. 2025. Ground- ing llm reasoning with knowledge graphs. (2025). https://arxiv.org/abs/2502.13 247 arXiv: 2502.13247[cs.CL]

arXiv 2025

[2] [2]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avi Sil, and Hannaneh Hajishirzi. 2024. Self-rag: learning to retrieve, generate, and critique through self-reflection. In International conference on learning representations. Vol. 2024, 9112–9141

2024

[3] [3]

Adib Bazgir, Yuwen Zhang, et al. 2025. Proteinhypothesis: a physics-aware chain of multi-agent rag llm for hypothesis generation in protein science. In Towards Agentic AI for Science: Hypothesis Generation, Comprehension, Quan- tification, and Validation

2025

[4] [4]

ChemCrow: Augmenting large-language models with chemistry tools

Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D. White, and Philippe Schwaller. 2023. ChemCrow: Augmenting large-language models with chemistry tools. arXiv:2304.05376 [physics]. (Oct. 2023). doi:10.48550/ar Xiv.2304.05376

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/ar 2023

[5] [5]

Tom Brown et al. 2020. Language models are few-shot learners.Advances in neural information processing systems, 33, 1877–1901

2020

[6] [6]

Benjamin Burger et al. 2020. A mobile robotic chemist.Nature, 583, 7815, 237– 241

2020

[7] [7]

Butler, Daniel W

Keith T. Butler, Daniel W. Davies, Hugh Cartwright, Olexandr Isayev, and Aron Walsh. 2018. Machine learning for molecular and materials science. en. Nature, 559, 7715, (July 2018), 547–555. Publisher: Nature Publishing Group. doi:10.1038/s41586-018-0337-2

work page doi:10.1038/s41586-018-0337-2 2018

[8] [8]

Zi-Yi Chen, Fan-Kai Xie, Meng Wan, Yang Yuan, Miao Liu, Zong-Guo Wang, Sheng Meng, and Yan-Gang Wang. 2023. MatChat: A large language model and application service platform for materials science. en.Chinese Physics B, 32, 11, (Nov. 2023), 118104. Publisher: Chinese Physical Society and IOP Publishing Ltd. doi:10.1088/1674-1056/ad04cb

work page doi:10.1088/1674-1056/ad04cb 2023

[9] [9]

Stefano Curtarolo, Gus LW Hart, Marco Buongiorno Nardelli, Natalio Mingo, Stefano Sanvito, and Ohad Levy. 2013. The high-throughput highway to com- putational materials design.Nature materials, 12, 3, 191–201

2013

[10] [10]

White et al

Andrew D. White et al. 2023. Assessment of chemistry knowledge in large language models that generate code. en, (Apr. 2023). Publisher: Royal Society of Chemistry. doi:10.1039/D2DD00087C

work page doi:10.1039/d2dd00087c 2023

[11] [11]

Rosen, Gerbrand Ceder, Kristin A

John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S. Rosen, Gerbrand Ceder, Kristin A. Persson, and Anubhav Jain. 2024. Structured information extraction from scientific text with large language models. en. Nature Communications, 15, 1, (Feb. 2024), 1418. Publisher: Nature Publishing Group. doi:10.1038/s41467-024-45563-x

work page doi:10.1038/s41467-024-45563-x 2024

[12] [12]

Darren Edge et al. 2025. From local to global: a graph rag approach to query- focused summarization. (2025). https : / / arxiv . org / abs / 2404 . 16130 arXiv: 2404.16130[cs.CL]

Pith/arXiv arXiv 2025

[13] [13]

Ali Essam Ghareeb et al. 2026. A multi-agent system for automating scientific discovery.Nature, 1–3

2026

[14] [14]

Tanishq Gupta, Mohd Zaki, N. M. Anoop Krishnan, and Mausam. 2022. MatSciB- ERT: A materials domain language model for text mining and information extraction. en.npj Computational Materials, 8, 1, (May 2022), 102. doi:10.1038/s 41524-022-00784-w

work page doi:10.1038/s 2022

[15] [15]

Jiashu He, Mingyu Derek Ma, Jinxuan Fan, Dan Roth, Wei Wang, and Alejandro Ribeiro. 2024. Give: structured reasoning of large language models with knowl- edge graph inspired veracity extrapolation.arXiv preprint arXiv:2410.08475

arXiv 2024

[16] [16]

Ziyang Huang et al. 2026. Can coding agents reproduce findings in computa- tional materials science?arXiv preprint arXiv:2605.00803

Pith/arXiv arXiv 2026

[17] [17]

Kevin Maik Jablonka, Philippe Schwaller, Andres Ortega-Guerrero, and Berend Smit. 2024. Leveraging large language models for predictive chemistry.Nature Machine Intelligence, 6, 2, 161–169

2024

[18] [18]

Anubhav Jain et al. 2013. Commentary: the materials project: a materials genome approach to accelerating materials innovation.APL materials, 1, 1

2013

[19] [19]

Xue Jiang, Weiren Wang, Shaohan Tian, Hao Wang, Turab Lookman, and Yan- jing Su. 2025. Applications of natural language processing and large language models in materials discovery. en.npj Computational Materials, 11, 1, (Mar. 2025), 79. Publisher: Nature Publishing Group. doi:10.1038/s41524-025-01554-0

work page doi:10.1038/s41524-025-01554-0 2025

[20] [20]

Emre Kiciman, Robert Ness, Amit Sharma, and Chenhao Tan. 2023. Causal reasoning and large language models: opening a new frontier for causality. Transactions on Machine Learning Research

2023

[21] [21]

Edward Kim et al. 2020. Inorganic Materials Synthesis Planning with Literature- Trained Neural Networks.Journal of Chemical Information and Modeling, 60, 3, (Mar. 2020), 1194–1201. Publisher: American Chemical Society. doi:10.1021/acs .jcim.9b00995

work page doi:10.1021/acs 2020

[22] [22]

Patrick Lewis et al. 2020. Retrieval-augmented generation for knowledge- intensive nlp tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems(NIPS ’20) Article 793. Curran Associates Inc., Vancouver, BC, Canada, 16 pages.isbn: 9781713829546

2020

[23] [23]

Songsong Li, Edward R Jira, Nicholas H Angello, Jialing Li, Hao Yu, Jeffrey S Moore, Ying Diao, Martin D Burke, and Charles M Schroeder. 2022. Using automated synthesis to understand the role of side chains on molecular charge transport.Nature communications, 13, 1, 2102

2022

[24] [24]

Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. 2021. Entity-based knowledge conflicts in question answer- ing. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, (Eds.) Association for Computatio...

2021

[25] [25]

Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. 2026. Towards end-to-end automation of ai research.Nature, 651, 8107, 914–919

2026

[26] [26]

Juhwan Noh, Jaehoon Kim, Helge S Stein, Benjamin Sanchez-Lengeling, John M Gregoire, Alan Aspuru-Guzik, and Yousung Jung. 2019. Inverse design of solid-state materials via a continuous representation.Matter, 1, 5, 1370–1384

2019

[27] [27]

Gregory B Olson. 1997. Computational design of hierarchically structured materials.Science, 277, 5330, 1237–1242

1997

[28] [28]

Maitreyee Sharma Priyadarshini, Oluwaseun Romiluyi, Yiran Wang, Kumar Miskin, Connor Ganley, and Paulette Clancy. 2024. Pal 2.0: a physics-driven bayesian optimization framework for material discovery.Materials Horizons, 11, 3, 781–791

2024

[29] [29]

Sreenivas Raguraman, Adam Griebel, Maitreyee Sharma Priyadharshini, Paulette Clancy, and Timothy P Weihs. 2025. A call to elevate the role of processing in ai-driven materials design.Nature Reviews Materials, 1–2

2025

[30] [30]

Aritra Roy et al. 2026. From knowledge to action: outcomes of the 2025 large language model (llm) hackathon for applications in materials science and chemistry.arXiv preprint arXiv:2605.03205

Pith/arXiv arXiv 2026

[31] [31]

Jonathan Schmidt, Mário RG Marques, Silvana Botti, and Miguel AL Marques

[32] [32]

Recent advances and applications of machine learning in solid-state materials science.npj computational materials, 5, 1, 83

[33] [33]

Nathan J Szymanski et al. 2023. An autonomous laboratory for the accelerated synthesis of inorganic materials.Nature, 624, 7990, 86

2023

[34] [34]

Amalie Trewartha et al. 2022. Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science.Patterns, 3, 4

2022

[35] [35]

Chengrui Wang, Qingqing Long, Meng Xiao, Xunxin Cai, Chengjun Wu, Zhen Meng, Xuezhi Wang, and Yuanchun Zhou. 2024. Biorag: a rag-llm framework for biological question reasoning.arXiv preprint arXiv:2408.01107

arXiv 2024

[36] [36]

Fengli Xu et al. 2025. Towards large reasoning models: a survey of reinforced reasoning with large language models. (2025). https://arxiv.org/abs/2501.09686 arXiv: 2501.09686[cs.AI]

Pith/arXiv arXiv 2025

[37] [37]

Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. 2024. Knowledge conflicts for LLMs: a survey. InPro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, (Eds.) Asso- ciation for Computational Linguistics, Miami, Florida, USA, (Nov. 202...

2024

[38] [38]

doi:10.18653/v1/2024.emnlp-main.486

work page doi:10.18653/v1/2024.emnlp-main.486 2024

[39] [39]

Ori Yoran, Tomer Wolfson, Ori Ram, and Jonathan Berant. 2024. Making retrieval-augmented language models robust to irrelevant context. InThe Twelfth International Conference on Learning Representations. https://openrevie w.net/forum?id=ZS4m74kZpH

2024

[40] [40]

Jinglan Zhang, Xinyi Chen, Xu Ye, Yulin Yang, and Bin Ai. 2025. Large language model in materials science: roles, challenges, and strategic outlook.Advanced Intelligent Discovery, 202500085

2025

[41] [41]

Yuzhe Zhang, Yipeng Zhang, Yidong Gan, Lina Yao, and Chen Wang. 2024. Causal graph discovery with retrieval-augmented generation based large lan- guage models. (2024). https://arxiv.org/abs/2402.15301 arXiv: 2402.15301 [cs.CL]

arXiv 2024

[42] [42]

Yizhen Zheng, Huan Yee Koh, Jiaxin Ju, Madeleine Yang, Lauren T May, Geof- frey I Webb, Li Li, Shirui Pan, and George Church. 2025. Large language models for drug discovery and development.Patterns, 6, 10

2025

[43] [43]

Use ONLY the verified causal mechanisms below

Zhiling Zheng, Oufan Zhang, Christian Borgs, Jennifer T. Chayes, and Omar M. Yaghi. 2023. ChatGPT Chemistry Assistant for Text Mining and the Prediction of MOF Synthesis.Journal of the American Chemical Society, 145, 32, (Aug. 2023), 18048–18062. Publisher: American Chemical Society. doi:10.1021/jacs.3c05819. A Algorithmic Details ofARIA This appendix pro...

work page doi:10.1021/jacs.3c05819 2023