arxiv: 2604.03460 · v2 · submitted 2026-04-03 · ⚛️ physics.chem-ph · physics.comp-ph

Recognition: 1 theorem link

· Lean Theorem

FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations

Andres Felipe Bocanegra Vargas, Branislav K. Nikoli\'c, Federico Garcia-Gaitan, Felipe Reyes-Osorio, Gang Meng, Jalil Varela-Manjarres, Mohammadhasan Dinpajooh, Tao E. Li, Xinwei Ji, Yafei Ren

Pith reviewed 2026-05-13 18:23 UTC · model grok-4.3

classification ⚛️ physics.chem-ph physics.comp-ph

keywords fermilinkagentpackagesscientificsimulationsresearchreproductionsimulation

0 comments

The pith

FermiLink separates package knowledge bases from simulation workflows so one agent can drive uniform results across fifty packages in nine domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

FermiLink introduces a unified agent framework for autonomous scientific simulations by isolating package-specific knowledge bases from the core workflows. This separation lets the same workflows, powered by a four-layer progressive disclosure mechanism, operate uniformly across supported packages without repeated custom engineering. Benchmarks across 132 figure-reproduction tasks with 44 packages show 56 percent success, including high-fidelity matches in some cases. A blinded test further shows the framework can generate research-grade output on unpublished polariton problems when given only objectives and source code.

Core claim

The paper claims that separating package knowledge bases from simulation workflows allows a single set of workflows to produce consistent results from figure-level tasks to full research projects on HPC clusters across roughly fifty packages spanning physics to engineering. With OpenAI Codex as the underlying model, the system reproduces 74 of 132 published figures at 56.1 percent overall success, of which 30 match at high fidelity and 35 at qualitative level, and it can deliver usable new results on undocumented problems when supplied with detailed objectives and code.

What carries the argument

The four-layer progressive disclosure mechanism that progressively surfaces information from isolated package knowledge bases to support uniform workflow execution across packages.

If this is right

Workflows can scale directly from single-figure reproduction to full-paper research on HPC clusters without redesign.
The same agent can handle tasks in nine domains with minimal additional setup per package.
Reproduction benchmarks indicate that 56 percent of published figures can be matched automatically, with a subset reaching high fidelity.
Blind operation on unpublished problems is possible when objectives and source code are supplied.
The approach supplies a reusable infrastructure for moving from scientific questions to computational results across domains.
pith_inferences':['The separation principle could apply to agent systems outside scientific simulation if similar knowledge isolation proves effective.','Future extensions might allow rapid addition of new packages with only code access rather than documentation or tutorials.','Success on undocumented polariton cases suggests that source-code inspection alone can substitute for external guidance i
keywords':['FermiLink','agent framework','autonomous simulations','multidomain scientific computing','progressive disclosure','software packages','figure reproduction','polariton physics'],
msc':[],'pacs':[],'feed_headline':'AI agent unifies simulations across 50 packages in 9 domains','feed_subtitle':'Separating knowledge bases from workflows lets one system reproduce figures and solve new problems uniformly.','feed_emoji':'🔬'}

Load-bearing premise

The separation of package knowledge bases from simulation workflows enables uniform and effective operation across diverse packages via the four-layer progressive disclosure mechanism without requiring substantial per-package custom engineering or frequent expert intervention.

What would settle it

A controlled test adding a new scientific package where the framework requires heavy custom code or repeated expert fixes to reach comparable reproduction rates would show the separation does not deliver the claimed uniformity.

Figures

Figures reproduced from arXiv: 2604.03460 by Andres Felipe Bocanegra Vargas, Branislav K. Nikoli\'c, Federico Garcia-Gaitan, Felipe Reyes-Osorio, Gang Meng, Jalil Varela-Manjarres, Mohammadhasan Dinpajooh, Tao E. Li, Xinwei Ji, Yafei Ren.

**Figure 2.** Figure 2: FIG. 2. Summary of the 132 figure-level reproduction tasks in SI Table S1. (a) Outcome distribution across nine scientific [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: FIG. 3. Comparison of calculated UP decay rates versus Rabi splitting for single-blinded simulations with the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Artificial-intelligence (AI) agent frameworks have been developed for autonomous scientific simulations, but most current agent frameworks are tailored to a single or a small set of software packages. Herein, FermiLink, a unified and extensible open-source agent framework is introduced for multidomain scientific simulations. Its key design principle is the separation of package knowledge bases from simulation workflows, so that simulation workflows in FermiLink, from figure-level simulations to full-paper-level research on high-performance computing clusters, operate uniformly among supported packages via a four-layer progressive disclosure mechanism. Using OpenAI Codex as the agent provider, the capabilities of FermiLink are demonstrated across approximately 50 scientific software packages spanning nine research domains from physics to engineering. Systematic benchmarks on 132 real-world figure-level reproduction tasks with 44 packages show that FermiLink reproduces 74 (56.1%) of published figures with simulations, among which 30 achieve high-fidelity agreement and 35 reach qualitative agreement with the target figures. A smaller set of human expert-guided reproduction benchmarks with 10 packages further highlights the importance of expert insights for improving the simulation fidelity. Beyond reproduction, a single-blinded study demonstrates that FermiLink can produce research-grade results on unpublished polariton physics problems when provided with sufficiently detailed research objectives and source code, even in the absence of external documentation or tutorials. Overall, FermiLink provides a scalable research infrastructure that may accelerate the path from scientific questions to computational results across diverse domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FermiLink shows a workable multi-package agent setup with real benchmarks but the autonomy for new problems hinges on how much expert work goes into each knowledge base.

read the letter

The main point is that FermiLink separates package knowledge bases from the core workflows and uses a four-layer progressive disclosure to let the same agent logic run across many different simulation packages. That design is the actual new piece, and it lets them cover roughly 50 packages in nine domains without custom agents for each one. The benchmarks give concrete numbers: 74 out of 132 figure reproductions succeeded, with 30 reaching high fidelity, plus a single-blinded test where it produced usable results on unpublished polariton problems from objectives and source code alone. Releasing the code openly is useful for anyone who wants to try extending it. Those are the parts that hold up on the evidence they present. The reproduction rate sits at 56 percent, and the paper notes that expert guidance lifts performance, so full hands-off operation is not there yet. The bigger open question is the cost of building and maintaining the knowledge bases for each new package. If that step still needs substantial domain-expert input on APIs, parameter ranges, and usage patterns, then the separation does not deliver the low-engineering scaling the abstract suggests for truly novel problems. The paper does not spell out how the bases for the 50 packages were created, which leaves the scalability claim thinner than the headline numbers imply. This is worth a serious referee for groups working on agent frameworks for computational science. The empirical tests and open release give enough substance to review, even though the work would need clearer details on knowledge-base construction and more cases on unseen packages before the autonomy story is solid.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces FermiLink, a unified open-source agent framework for multidomain scientific simulations. Its core design separates package knowledge bases from simulation workflows, enabling uniform operation across ~50 packages in nine domains via a four-layer progressive disclosure mechanism. Using OpenAI Codex, it reports systematic benchmarks reproducing 74 of 132 real-world figures (56.1%, with 30 high-fidelity and 35 qualitative), notes improved fidelity with human guidance on 10 packages, and presents a single-blinded study showing research-grade results on unpublished polariton physics problems given detailed objectives and source code without external documentation.

Significance. If the empirical results hold, FermiLink could provide a scalable infrastructure that accelerates the transition from scientific questions to computational outcomes across physics, chemistry, engineering, and related fields by reducing package-specific engineering. The concrete benchmark numbers and blinded study on novel problems supply falsifiable evidence of multidomain capability, though the moderate success rate and documented need for expert input limit the strength of claims for full autonomy.

major comments (2)

[Benchmark Results] Benchmark section (132 figure-level tasks): the 56.1% reproduction rate (74/132, only 30 high-fidelity) together with the explicit statement that human expert guidance improves fidelity already demonstrates that source-code-only operation is error-prone even for known tasks; this directly undermines the claim that the four-layer mechanism delivers uniform, low-intervention performance once knowledge bases are separated.
[Single-blinded Study] Single-blinded polariton study: success is reported only when 'sufficiently detailed research objectives and source code' are supplied; because accurate knowledge bases covering APIs, parameter ranges, and usage patterns must still be constructed per package, the separation principle does not eliminate the need for substantial domain-expert curation, weakening generalization to truly novel problems.

minor comments (2)

[Abstract] Abstract states 'approximately 50' packages while benchmarks use 44; provide an exact count and clarify which packages were used for the blinded study versus the 132-task benchmark.
[Methods] The four-layer progressive disclosure mechanism is introduced as a key innovation but lacks a concise schematic or pseudocode; a small diagram or enumerated steps would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which help us clarify the scope and limitations of FermiLink. We provide point-by-point responses below and have revised the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses

Referee: [Benchmark Results] Benchmark section (132 figure-level tasks): the 56.1% reproduction rate (74/132, only 30 high-fidelity) together with the explicit statement that human expert guidance improves fidelity already demonstrates that source-code-only operation is error-prone even for known tasks; this directly undermines the claim that the four-layer mechanism delivers uniform, low-intervention performance once knowledge bases are separated.

Authors: The 56.1% reproduction rate is presented transparently in the manuscript as the current performance level using OpenAI Codex. The four-layer mechanism enables this level of performance uniformly across 44 packages by progressively disclosing information and handling errors in a standardized way, which would not be feasible without the separation of knowledge bases from workflows. We do not claim error-free or fully autonomous operation; rather, the framework reduces the engineering overhead for multidomain use. The human guidance results are included to show potential for improvement, not to indicate failure of the base system. To address the concern, we will add a dedicated limitations subsection in the revised manuscript discussing the current success rates and the role of expert input. revision: partial
Referee: [Single-blinded Study] Single-blinded polariton study: success is reported only when 'sufficiently detailed research objectives and source code' are supplied; because accurate knowledge bases covering APIs, parameter ranges, and usage patterns must still be constructed per package, the separation principle does not eliminate the need for substantial domain-expert curation, weakening generalization to truly novel problems.

Authors: In the single-blinded study, the provision of detailed objectives and source code allows the agent to operate on unpublished problems without relying on external documentation, which is the key test of the framework's generalization capability. The construction of knowledge bases is indeed necessary but is decoupled from the simulation workflow, allowing the same workflow code to be applied across domains once the base is built. This separation is what enables scaling to ~50 packages. We acknowledge that expert curation is required for new packages and do not claim zero-effort deployment for arbitrary new software. We will revise the text to explicitly state that knowledge base construction remains a domain-expert task, while emphasizing the uniformity of the operational layer. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical benchmarks and blinded study are independent of any derivation chain

full rationale

The manuscript describes a software framework whose central claims rest on reported reproduction rates (74/132 tasks) and a single-blinded polariton study. No equations, uniqueness theorems, fitted parameters, or first-principles derivations appear in the provided text. The four-layer mechanism is presented as an engineering design choice, not a result derived from prior outputs of the same system. Self-citations are absent from the load-bearing sections. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper introduces a new software architecture without new physical parameters, fitted constants, or entities possessing independent falsifiable evidence outside the framework itself.

axioms (1)

domain assumption Large language models can interpret and execute scientific simulation tasks when given appropriate context and knowledge bases.
This underpins the agent's ability to operate uniformly across packages.

invented entities (1)

Four-layer progressive disclosure mechanism no independent evidence
purpose: To enable uniform workflows by gradually revealing package-specific information.
New architectural component introduced to support the separation principle.

pith-pipeline@v0.9.0 · 5610 in / 1448 out tokens · 58230 ms · 2026-05-13T18:23:32.473544+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
Its key design principle is the separation of package knowledge bases from simulation workflows... four-layer progressive disclosure mechanism

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 6 internal anchors

[1]

Dongarra and D

J. Dongarra and D. Keyes, The co-evolution of compu- tational physics and high-performance computing, Nat. Rev. Phys.6, 621 (2024)

work page 2024
[2]

Barbatti, When theory came first: a review of theo- retical chemical predictions ahead of experiments, Pure Appl

M. Barbatti, When theory came first: a review of theo- retical chemical predictions ahead of experiments, Pure Appl. Chem.97, 1115 (2025)

work page 2025
[3]

A. V. Sadybekov and V. Katritch, Computational ap- proaches streamlining drug discovery, Nature616, 673 (2023)

work page 2023
[4]

Computational quantum transport: a scattering approach perspective

X. Waintal, M. Wimmer, A. Akhmerov, C. Groth, B. K. Nikolić, M. Istas, T. Örn Rosdahl, and D. Varjas, Com- putational quantum transport, arXiv:2407.16257 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

D. E. Post and L. G. Votta, Computational science de- mands a new paradigm, Phys. Today58, 35 (2005)

work page 2005
[6]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka- plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- try, A. Askell, and et al., Language Models are Few-Shot Learners, NeurIPS33, 1877 (2020)

work page 2020
[7]

GPT-4 Technical Report

OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ah- mad, I. Akkaya, F. L. Aleman, D. Almeida, J. Al- tenschmidt, S. Altman, and et al., GPT-4 Technical Re- port, arXiv:2303.08774 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, and et al., Evaluating Large Language Models Trained on Code, arXiv:2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, SWE-agent: Agent- Computer Interfaces Enable Automated Software Engi- neering, arXiv:2405.15793 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes, Au- tonomous chemical research with large language models, Nature 624, 570 (2023)

work page 2023
[11]

A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P. Schwaller, Augmenting large language models with chemistry tools, Nat. Mach. Intell.6, 525 (2024)

work page 2024
[12]

C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha, The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery, arXiv:2408.06292 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Towards an AI co-scientist

J. Gottweis, W.-H. Weng, A. Daryin, T. Tu, A. Palepu, P. Sirkovic, A. Myaskovsky, F. Weissenberger, K. Rong, R. Tanno,et al., Towards an AI co-scientist, arXiv:2502.18864 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Agent laboratory: Using llm agents as research assistants.arXiv preprint arXiv:2501.04227, 2025

S. Schmidgall, Y. Su, Z. Wang, X. Sun, J. Wu, X. Yu, J. Liu, M. Moor, Z. Liu, and E. Barsoum, Agent Lab- oratory: Using LLM Agents as Research Assistants, arXiv:2501.04227 (2025)

work page arXiv 2025
[15]

M. C. Ramos, C. J. Collison, and A. D. White, A re- view of large language models and autonomous agents in chemistry, Chem. Sci.16, 2514 (2025)

work page 2025
[16]

R. S. K. Gadde, S. Devaguptam, F. Ren, R. Mittal, L. Dong, Y. Wang, and F. Liu, Chatbot-Assisted Quan- tum Chemistry for Explicitly Solvated Molecules, Chem. Sci. 16, 3852 (2025)

work page 2025
[17]

Campbell, S

Q. Campbell, S. Cox, J. Medina, B. Watterson, and A. D. White, MDCrow: Automating Molecular Dy- namics Workflows with Large Language Models, Mach. Learn.: Sci. Technol. (2026)

work page 2026
[18]

Y. Zou, A. H. Cheng, A. Aldossary, J. Bai, S. X. Leong, J. A. Campos-Gonzalez-Angulo, C. Choi, C. T. Ser, 8 G. Tom, A. Wang,et al., El Agente: An Autonomous Agent for Quantum Chemistry, Matter8, 102263 (2025)

work page 2025
[19]

https://arxiv.org/abs/2512.18847

I. Gustin, L. Mantilla Calderón, J. B. Pérez-Sánchez, J. F. Gonthier, Y. Nakamura, K. Panicker, M. Ram- prasad, Z. Zhang, Y. Zou, V. Bernales, and A. Aspuru- Guzik, El Agente Cuántico: Automating Quantum Sim- ulations, arXiv.2512.18847 (2025)

work page arXiv 2025
[20]

M. D. Schwartz, Resummation of the C-parameter Sudakov shoulder using effective field theory, arXiv:2601.02484 (2026)

work page arXiv 2026
[21]

Z. Hu, K. Talit, Z. Wang, H. Ahmad, Y. Lin, P. Kaur, C. Lane, E. A. Peterson, Z. Hu, E. A. Nowadnick, and Y. Ding, TritonDFT: Automating DFT with a Multi- Agent Framework, arXiv:2603.03372 (2026)

work page arXiv 2026
[22]

Z. Wang, H. Huang, H. Zhao, C. Xu, S. Zhu, J. Janssen, andV.Viswanathan,DREAMS:DensityFunctionalThe- ory Based Research Engine for Agentic Materials Simu- lation, arXiv:2507.14267 (2025)

work page arXiv 2025
[23]

L. Yao, S. Samantray, A. Ghosh, K. Roccapriore, L. Kovarik, S. Allec, and M. Ziatdinov, Operational- izing Serendipity: Multi-Agent AI Workflows for En- hanced Materials Characterization with Theory-in-the- Loop, arXiv:2508.06569 (2025)

work page arXiv 2025
[24]

E., GitHub:skilled-scipkg Repositories (2026)

Li, T. E., GitHub:skilled-scipkg Repositories (2026)

work page 2026
[25]

Johansson, P

J. Johansson, P. Nation, and F. Nori, QuTiP: An open- source Python framework for the dynamics of open quan- tumsystems,Comput.Phys.Commun. 183,1760(2012)

work page 2012
[26]

Reyes-Osorio, F

F. Reyes-Osorio, F. García-Gaitán, D. J. Strachan, P. Plecháč, S. R. Clark, and B. K. Nikolić, Schwinger- Keldysh non-perturbative field theory of open quan- tum systems beyond the Markovian regime: Application to spin-boson and spin-chain-boson models, Rep. Prog. Phys. 89, 018002 (2026)

work page 2026
[27]

uhne, M. Iannuzzi, M. Del Ben, V. V. Ry- bkin, P. Seewald, F. Stein, T. Laino, R. Z. Khaliullin, O. Sch

T. D. K"uhne, M. Iannuzzi, M. Del Ben, V. V. Ry- bkin, P. Seewald, F. Stein, T. Laino, R. Z. Khaliullin, O. Sch"utt, F. Schiffmann, and et al., CP2K: An elec- tronic structure and molecular dynamics software pack- age - Quickstep: Efficient and accurate electronic struc- ture calculations, J. Chem. Phys.152, 194103 (2020)

work page 2020
[28]

Madarász, B

Á. Madarász, B. B. Mészáros, and J. Daru, Sys- tematic incorporation of nuclear quantum effects into atomistic simulations by smoothed trajectory analysis, arXiv:2602.06725 (2026)

work page arXiv 2026
[29]

T. E. Li, FDTD with Auxiliary Bath Fields for Condensed-Phase Polaritonics: Fundamentals and Im- plementation, APL Comput. Phys.1, 016103 (2025)

work page 2025
[30]

A. F. Oskooi, D. Roundy, M. Ibanescu, P. Bermel, J. Joannopoulos, and S. G. Johnson, Meep: A flexible free-software package for electromagnetic simulations by the FDTD method, Comput. Phys. Commun.181, 687 (2010)

work page 2010
[31]

R. F. Ribeiro, L. A. Martínez-Martínez, M. Du, J. Campos-Gonzalez-Angulo, and J. Yuen-Zhou, Polari- ton Chemistry: Controlling Molecular Dynamics with Optical Cavities, Chem. Sci.9, 6325 (2018)

work page 2018
[32]

Mandal, M

A. Mandal, M. A. Taylor, B. M. Weight, E. R. Koessler, X. Li, and P. Huo, Theoretical Advances in Polariton Chemistry and Molecular Cavity Quantum Electrody- namics, Chem. Rev.123, 9786 (2023)

work page 2023
[33]

Ruggenthaler, D

M. Ruggenthaler, D. Sidler, and A. Rubio, Understand- ing Polaritonic Chemistry from Ab Initio Quantum Elec- trodynamics, Chem. Rev.123, 11191 (2023)

work page 2023
[34]

G. Ling, S. Zhong, and R. Huang, Agent Skills: A Data- Driven Analysis of Claude Skills for Extending Large LanguageModelFunctionality,arXiv:2602.08004 (2026)

work page arXiv 2026
[35]

Hatton, The t experiments: errors in scientific soft- ware, IEEE Comput

L. Hatton, The t experiments: errors in scientific soft- ware, IEEE Comput. Sci. Eng.4, 27 (1997)

work page 1997
[36]

K. T. Williams, Y. Yao, J. Li, L. Chen, H. Shi, M. Motta, C. Niu, U. Ray, S. Guo, R. J. Anderson, and et al. (Si- mons Collaboration on the Many-Electron Problem), Di- rect comparison of many-body methods for realistic elec- tronic hamiltonians, Phys. Rev. X10, 011041 (2020)

work page 2020