pith. machine review for the scientific record. sign in

arxiv: 2604.03460 · v2 · submitted 2026-04-03 · ⚛️ physics.chem-ph · physics.comp-ph

Recognition: 1 theorem link

· Lean Theorem

FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations

Andres Felipe Bocanegra Vargas, Branislav K. Nikoli\'c, Federico Garcia-Gaitan, Felipe Reyes-Osorio, Gang Meng, Jalil Varela-Manjarres, Mohammadhasan Dinpajooh, Tao E. Li, Xinwei Ji, Yafei Ren

Pith reviewed 2026-05-13 18:23 UTC · model grok-4.3

classification ⚛️ physics.chem-ph physics.comp-ph
keywords fermilinkagentpackagesscientificsimulationsresearchreproductionsimulation
0
0 comments X

The pith

FermiLink separates package knowledge bases from simulation workflows so one agent can drive uniform results across fifty packages in nine domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

FermiLink introduces a unified agent framework for autonomous scientific simulations by isolating package-specific knowledge bases from the core workflows. This separation lets the same workflows, powered by a four-layer progressive disclosure mechanism, operate uniformly across supported packages without repeated custom engineering. Benchmarks across 132 figure-reproduction tasks with 44 packages show 56 percent success, including high-fidelity matches in some cases. A blinded test further shows the framework can generate research-grade output on unpublished polariton problems when given only objectives and source code.

Core claim

The paper claims that separating package knowledge bases from simulation workflows allows a single set of workflows to produce consistent results from figure-level tasks to full research projects on HPC clusters across roughly fifty packages spanning physics to engineering. With OpenAI Codex as the underlying model, the system reproduces 74 of 132 published figures at 56.1 percent overall success, of which 30 match at high fidelity and 35 at qualitative level, and it can deliver usable new results on undocumented problems when supplied with detailed objectives and code.

What carries the argument

The four-layer progressive disclosure mechanism that progressively surfaces information from isolated package knowledge bases to support uniform workflow execution across packages.

If this is right

  • Workflows can scale directly from single-figure reproduction to full-paper research on HPC clusters without redesign.
  • The same agent can handle tasks in nine domains with minimal additional setup per package.
  • Reproduction benchmarks indicate that 56 percent of published figures can be matched automatically, with a subset reaching high fidelity.
  • Blind operation on unpublished problems is possible when objectives and source code are supplied.
  • The approach supplies a reusable infrastructure for moving from scientific questions to computational results across domains.
  • pith_inferences':['The separation principle could apply to agent systems outside scientific simulation if similar knowledge isolation proves effective.','Future extensions might allow rapid addition of new packages with only code access rather than documentation or tutorials.','Success on undocumented polariton cases suggests that source-code inspection alone can substitute for external guidance i
  • keywords':['FermiLink','agent framework','autonomous simulations','multidomain scientific computing','progressive disclosure','software packages','figure reproduction','polariton physics'],
  • msc':[],'pacs':[],'feed_headline':'AI agent unifies simulations across 50 packages in 9 domains','feed_subtitle':'Separating knowledge bases from workflows lets one system reproduce figures and solve new problems uniformly.','feed_emoji':'🔬'}

Load-bearing premise

The separation of package knowledge bases from simulation workflows enables uniform and effective operation across diverse packages via the four-layer progressive disclosure mechanism without requiring substantial per-package custom engineering or frequent expert intervention.

What would settle it

A controlled test adding a new scientific package where the framework requires heavy custom code or repeated expert fixes to reach comparable reproduction rates would show the separation does not deliver the claimed uniformity.

Figures

Figures reproduced from arXiv: 2604.03460 by Andres Felipe Bocanegra Vargas, Branislav K. Nikoli\'c, Federico Garcia-Gaitan, Felipe Reyes-Osorio, Gang Meng, Jalil Varela-Manjarres, Mohammadhasan Dinpajooh, Tao E. Li, Xinwei Ji, Yafei Ren.

Figure 1
Figure 1. Figure 1: FIG. 1. Design of the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIG. 2. Summary of the 132 figure-level reproduction tasks in SI Table S1. (a) Outcome distribution across nine scientific [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FIG. 3. Comparison of calculated UP decay rates versus Rabi splitting for single-blinded simulations with the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Artificial-intelligence (AI) agent frameworks have been developed for autonomous scientific simulations, but most current agent frameworks are tailored to a single or a small set of software packages. Herein, FermiLink, a unified and extensible open-source agent framework is introduced for multidomain scientific simulations. Its key design principle is the separation of package knowledge bases from simulation workflows, so that simulation workflows in FermiLink, from figure-level simulations to full-paper-level research on high-performance computing clusters, operate uniformly among supported packages via a four-layer progressive disclosure mechanism. Using OpenAI Codex as the agent provider, the capabilities of FermiLink are demonstrated across approximately 50 scientific software packages spanning nine research domains from physics to engineering. Systematic benchmarks on 132 real-world figure-level reproduction tasks with 44 packages show that FermiLink reproduces 74 (56.1%) of published figures with simulations, among which 30 achieve high-fidelity agreement and 35 reach qualitative agreement with the target figures. A smaller set of human expert-guided reproduction benchmarks with 10 packages further highlights the importance of expert insights for improving the simulation fidelity. Beyond reproduction, a single-blinded study demonstrates that FermiLink can produce research-grade results on unpublished polariton physics problems when provided with sufficiently detailed research objectives and source code, even in the absence of external documentation or tutorials. Overall, FermiLink provides a scalable research infrastructure that may accelerate the path from scientific questions to computational results across diverse domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces FermiLink, a unified open-source agent framework for multidomain scientific simulations. Its core design separates package knowledge bases from simulation workflows, enabling uniform operation across ~50 packages in nine domains via a four-layer progressive disclosure mechanism. Using OpenAI Codex, it reports systematic benchmarks reproducing 74 of 132 real-world figures (56.1%, with 30 high-fidelity and 35 qualitative), notes improved fidelity with human guidance on 10 packages, and presents a single-blinded study showing research-grade results on unpublished polariton physics problems given detailed objectives and source code without external documentation.

Significance. If the empirical results hold, FermiLink could provide a scalable infrastructure that accelerates the transition from scientific questions to computational outcomes across physics, chemistry, engineering, and related fields by reducing package-specific engineering. The concrete benchmark numbers and blinded study on novel problems supply falsifiable evidence of multidomain capability, though the moderate success rate and documented need for expert input limit the strength of claims for full autonomy.

major comments (2)
  1. [Benchmark Results] Benchmark section (132 figure-level tasks): the 56.1% reproduction rate (74/132, only 30 high-fidelity) together with the explicit statement that human expert guidance improves fidelity already demonstrates that source-code-only operation is error-prone even for known tasks; this directly undermines the claim that the four-layer mechanism delivers uniform, low-intervention performance once knowledge bases are separated.
  2. [Single-blinded Study] Single-blinded polariton study: success is reported only when 'sufficiently detailed research objectives and source code' are supplied; because accurate knowledge bases covering APIs, parameter ranges, and usage patterns must still be constructed per package, the separation principle does not eliminate the need for substantial domain-expert curation, weakening generalization to truly novel problems.
minor comments (2)
  1. [Abstract] Abstract states 'approximately 50' packages while benchmarks use 44; provide an exact count and clarify which packages were used for the blinded study versus the 132-task benchmark.
  2. [Methods] The four-layer progressive disclosure mechanism is introduced as a key innovation but lacks a concise schematic or pseudocode; a small diagram or enumerated steps would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which help us clarify the scope and limitations of FermiLink. We provide point-by-point responses below and have revised the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Benchmark Results] Benchmark section (132 figure-level tasks): the 56.1% reproduction rate (74/132, only 30 high-fidelity) together with the explicit statement that human expert guidance improves fidelity already demonstrates that source-code-only operation is error-prone even for known tasks; this directly undermines the claim that the four-layer mechanism delivers uniform, low-intervention performance once knowledge bases are separated.

    Authors: The 56.1% reproduction rate is presented transparently in the manuscript as the current performance level using OpenAI Codex. The four-layer mechanism enables this level of performance uniformly across 44 packages by progressively disclosing information and handling errors in a standardized way, which would not be feasible without the separation of knowledge bases from workflows. We do not claim error-free or fully autonomous operation; rather, the framework reduces the engineering overhead for multidomain use. The human guidance results are included to show potential for improvement, not to indicate failure of the base system. To address the concern, we will add a dedicated limitations subsection in the revised manuscript discussing the current success rates and the role of expert input. revision: partial

  2. Referee: [Single-blinded Study] Single-blinded polariton study: success is reported only when 'sufficiently detailed research objectives and source code' are supplied; because accurate knowledge bases covering APIs, parameter ranges, and usage patterns must still be constructed per package, the separation principle does not eliminate the need for substantial domain-expert curation, weakening generalization to truly novel problems.

    Authors: In the single-blinded study, the provision of detailed objectives and source code allows the agent to operate on unpublished problems without relying on external documentation, which is the key test of the framework's generalization capability. The construction of knowledge bases is indeed necessary but is decoupled from the simulation workflow, allowing the same workflow code to be applied across domains once the base is built. This separation is what enables scaling to ~50 packages. We acknowledge that expert curation is required for new packages and do not claim zero-effort deployment for arbitrary new software. We will revise the text to explicitly state that knowledge base construction remains a domain-expert task, while emphasizing the uniformity of the operational layer. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical benchmarks and blinded study are independent of any derivation chain

full rationale

The manuscript describes a software framework whose central claims rest on reported reproduction rates (74/132 tasks) and a single-blinded polariton study. No equations, uniqueness theorems, fitted parameters, or first-principles derivations appear in the provided text. The four-layer mechanism is presented as an engineering design choice, not a result derived from prior outputs of the same system. Self-citations are absent from the load-bearing sections. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper introduces a new software architecture without new physical parameters, fitted constants, or entities possessing independent falsifiable evidence outside the framework itself.

axioms (1)
  • domain assumption Large language models can interpret and execute scientific simulation tasks when given appropriate context and knowledge bases.
    This underpins the agent's ability to operate uniformly across packages.
invented entities (1)
  • Four-layer progressive disclosure mechanism no independent evidence
    purpose: To enable uniform workflows by gradually revealing package-specific information.
    New architectural component introduced to support the separation principle.

pith-pipeline@v0.9.0 · 5610 in / 1448 out tokens · 58230 ms · 2026-05-13T18:23:32.473544+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 6 internal anchors

  1. [1]

    Dongarra and D

    J. Dongarra and D. Keyes, The co-evolution of compu- tational physics and high-performance computing, Nat. Rev. Phys.6, 621 (2024)

  2. [2]

    Barbatti, When theory came first: a review of theo- retical chemical predictions ahead of experiments, Pure Appl

    M. Barbatti, When theory came first: a review of theo- retical chemical predictions ahead of experiments, Pure Appl. Chem.97, 1115 (2025)

  3. [3]

    A. V. Sadybekov and V. Katritch, Computational ap- proaches streamlining drug discovery, Nature616, 673 (2023)

  4. [4]

    Computational quantum transport: a scattering approach perspective

    X. Waintal, M. Wimmer, A. Akhmerov, C. Groth, B. K. Nikolić, M. Istas, T. Örn Rosdahl, and D. Varjas, Com- putational quantum transport, arXiv:2407.16257 (2024)

  5. [5]

    D. E. Post and L. G. Votta, Computational science de- mands a new paradigm, Phys. Today58, 35 (2005)

  6. [6]

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka- plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- try, A. Askell, and et al., Language Models are Few-Shot Learners, NeurIPS33, 1877 (2020)

  7. [7]

    GPT-4 Technical Report

    OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ah- mad, I. Akkaya, F. L. Aleman, D. Almeida, J. Al- tenschmidt, S. Altman, and et al., GPT-4 Technical Re- port, arXiv:2303.08774 (2024)

  8. [8]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, and et al., Evaluating Large Language Models Trained on Code, arXiv:2107.03374 (2021)

  9. [9]

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, SWE-agent: Agent- Computer Interfaces Enable Automated Software Engi- neering, arXiv:2405.15793 (2024)

  10. [10]

    D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes, Au- tonomous chemical research with large language models, Nature 624, 570 (2023)

  11. [11]

    A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P. Schwaller, Augmenting large language models with chemistry tools, Nat. Mach. Intell.6, 525 (2024)

  12. [12]

    C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha, The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery, arXiv:2408.06292 (2024)

  13. [13]

    Towards an AI co-scientist

    J. Gottweis, W.-H. Weng, A. Daryin, T. Tu, A. Palepu, P. Sirkovic, A. Myaskovsky, F. Weissenberger, K. Rong, R. Tanno,et al., Towards an AI co-scientist, arXiv:2502.18864 (2025)

  14. [14]

    Agent laboratory: Using llm agents as research assistants.arXiv preprint arXiv:2501.04227, 2025

    S. Schmidgall, Y. Su, Z. Wang, X. Sun, J. Wu, X. Yu, J. Liu, M. Moor, Z. Liu, and E. Barsoum, Agent Lab- oratory: Using LLM Agents as Research Assistants, arXiv:2501.04227 (2025)

  15. [15]

    M. C. Ramos, C. J. Collison, and A. D. White, A re- view of large language models and autonomous agents in chemistry, Chem. Sci.16, 2514 (2025)

  16. [16]

    R. S. K. Gadde, S. Devaguptam, F. Ren, R. Mittal, L. Dong, Y. Wang, and F. Liu, Chatbot-Assisted Quan- tum Chemistry for Explicitly Solvated Molecules, Chem. Sci. 16, 3852 (2025)

  17. [17]

    Campbell, S

    Q. Campbell, S. Cox, J. Medina, B. Watterson, and A. D. White, MDCrow: Automating Molecular Dy- namics Workflows with Large Language Models, Mach. Learn.: Sci. Technol. (2026)

  18. [18]

    Y. Zou, A. H. Cheng, A. Aldossary, J. Bai, S. X. Leong, J. A. Campos-Gonzalez-Angulo, C. Choi, C. T. Ser, 8 G. Tom, A. Wang,et al., El Agente: An Autonomous Agent for Quantum Chemistry, Matter8, 102263 (2025)

  19. [19]

    https://arxiv.org/abs/2512.18847

    I. Gustin, L. Mantilla Calderón, J. B. Pérez-Sánchez, J. F. Gonthier, Y. Nakamura, K. Panicker, M. Ram- prasad, Z. Zhang, Y. Zou, V. Bernales, and A. Aspuru- Guzik, El Agente Cuántico: Automating Quantum Sim- ulations, arXiv.2512.18847 (2025)

  20. [20]

    M. D. Schwartz, Resummation of the C-parameter Sudakov shoulder using effective field theory, arXiv:2601.02484 (2026)

  21. [21]

    Z. Hu, K. Talit, Z. Wang, H. Ahmad, Y. Lin, P. Kaur, C. Lane, E. A. Peterson, Z. Hu, E. A. Nowadnick, and Y. Ding, TritonDFT: Automating DFT with a Multi- Agent Framework, arXiv:2603.03372 (2026)

  22. [22]

    Z. Wang, H. Huang, H. Zhao, C. Xu, S. Zhu, J. Janssen, andV.Viswanathan,DREAMS:DensityFunctionalThe- ory Based Research Engine for Agentic Materials Simu- lation, arXiv:2507.14267 (2025)

  23. [23]

    L. Yao, S. Samantray, A. Ghosh, K. Roccapriore, L. Kovarik, S. Allec, and M. Ziatdinov, Operational- izing Serendipity: Multi-Agent AI Workflows for En- hanced Materials Characterization with Theory-in-the- Loop, arXiv:2508.06569 (2025)

  24. [24]

    E., GitHub:skilled-scipkg Repositories (2026)

    Li, T. E., GitHub:skilled-scipkg Repositories (2026)

  25. [25]

    Johansson, P

    J. Johansson, P. Nation, and F. Nori, QuTiP: An open- source Python framework for the dynamics of open quan- tumsystems,Comput.Phys.Commun. 183,1760(2012)

  26. [26]

    Reyes-Osorio, F

    F. Reyes-Osorio, F. García-Gaitán, D. J. Strachan, P. Plecháč, S. R. Clark, and B. K. Nikolić, Schwinger- Keldysh non-perturbative field theory of open quan- tum systems beyond the Markovian regime: Application to spin-boson and spin-chain-boson models, Rep. Prog. Phys. 89, 018002 (2026)

  27. [27]

    uhne, M. Iannuzzi, M. Del Ben, V. V. Ry- bkin, P. Seewald, F. Stein, T. Laino, R. Z. Khaliullin, O. Sch

    T. D. K"uhne, M. Iannuzzi, M. Del Ben, V. V. Ry- bkin, P. Seewald, F. Stein, T. Laino, R. Z. Khaliullin, O. Sch"utt, F. Schiffmann, and et al., CP2K: An elec- tronic structure and molecular dynamics software pack- age - Quickstep: Efficient and accurate electronic struc- ture calculations, J. Chem. Phys.152, 194103 (2020)

  28. [28]

    Madarász, B

    Á. Madarász, B. B. Mészáros, and J. Daru, Sys- tematic incorporation of nuclear quantum effects into atomistic simulations by smoothed trajectory analysis, arXiv:2602.06725 (2026)

  29. [29]

    T. E. Li, FDTD with Auxiliary Bath Fields for Condensed-Phase Polaritonics: Fundamentals and Im- plementation, APL Comput. Phys.1, 016103 (2025)

  30. [30]

    A. F. Oskooi, D. Roundy, M. Ibanescu, P. Bermel, J. Joannopoulos, and S. G. Johnson, Meep: A flexible free-software package for electromagnetic simulations by the FDTD method, Comput. Phys. Commun.181, 687 (2010)

  31. [31]

    R. F. Ribeiro, L. A. Martínez-Martínez, M. Du, J. Campos-Gonzalez-Angulo, and J. Yuen-Zhou, Polari- ton Chemistry: Controlling Molecular Dynamics with Optical Cavities, Chem. Sci.9, 6325 (2018)

  32. [32]

    Mandal, M

    A. Mandal, M. A. Taylor, B. M. Weight, E. R. Koessler, X. Li, and P. Huo, Theoretical Advances in Polariton Chemistry and Molecular Cavity Quantum Electrody- namics, Chem. Rev.123, 9786 (2023)

  33. [33]

    Ruggenthaler, D

    M. Ruggenthaler, D. Sidler, and A. Rubio, Understand- ing Polaritonic Chemistry from Ab Initio Quantum Elec- trodynamics, Chem. Rev.123, 11191 (2023)

  34. [34]

    G. Ling, S. Zhong, and R. Huang, Agent Skills: A Data- Driven Analysis of Claude Skills for Extending Large LanguageModelFunctionality,arXiv:2602.08004 (2026)

  35. [35]

    Hatton, The t experiments: errors in scientific soft- ware, IEEE Comput

    L. Hatton, The t experiments: errors in scientific soft- ware, IEEE Comput. Sci. Eng.4, 27 (1997)

  36. [36]

    K. T. Williams, Y. Yao, J. Li, L. Chen, H. Shi, M. Motta, C. Niu, U. Ray, S. Guo, R. J. Anderson, and et al. (Si- mons Collaboration on the Many-Electron Problem), Di- rect comparison of many-body methods for realistic elec- tronic hamiltonians, Phys. Rev. X10, 011041 (2020)