pith. sign in

arxiv: 2606.07697 · v1 · pith:44YXSIZ2new · submitted 2026-06-05 · ⚛️ physics.ao-ph · cs.AI

TianJi-Environ: An Autonomous AI Scientist for Atmospheric Environmental Research

Pith reviewed 2026-06-27 20:27 UTC · model grok-4.3

classification ⚛️ physics.ao-ph cs.AI
keywords AI ScientistWRF-Chematmospheric chemistrymechanism validationmulti-agent systemozoneparticulate matteraerosol-radiation interaction
0
0 comments X

The pith

TianJi-Environ is the first WRF-Chem multi-agent system that turns mechanistic hypotheses into autonomous atmospheric-chemistry simulations and auditable evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TianJi-Environ as an AI system that removes the need for constant expert intervention in validating pollution mechanisms. It uses a multi-agent framework built on WRF-Chem to take a hypothesis, set up the corresponding model runs, execute them, and judge whether the outputs provide complete evidence. Demonstrations on summer ozone over the North China Plain and winter PM2.5 over the Guanzhong Basin show the system identifying consistent signals in some cases and pinpointing missing process links in others. If the approach works, mechanism validation becomes an explicit, repeatable workflow rather than an opaque expert task.

Core claim

TianJi-Environ establishes the first WRF-Chem-based multi-agent framework that autonomously drives complex atmospheric-chemistry simulations, converting mechanistic hypotheses into executable configurations, testing experiments, and evidence criteria. In the ozone case it detects directionally consistent aerosol-radiation-interaction signals yet judges evidence for NOx-control response incomplete; in the PM2.5 case it traces the unsupported link to insufficient black-carbon propagation and absent vertical-heating diagnostics. These results make expert-driven mechanism validation explicit, structured, and auditable.

What carries the argument

The WRF-Chem-based multi-agent framework that operationalizes hypotheses into model configurations, runs experiments, and applies evidence criteria.

If this is right

  • Mechanism validation for ozone response to NOx control can be performed with explicit detection of aerosol-radiation signals alongside an incompleteness judgment.
  • Particulate-matter feedback studies can localize unsupported links to specific missing propagations such as black-carbon effects on vertical heating.
  • Atmospheric-chemistry experiments become traceable sequences of hypothesis, configuration, output, and evidence criterion rather than ad-hoc expert runs.
  • The same multi-agent structure can be applied to other mechanistic questions in WRF-Chem without redesigning the workflow for each new hypothesis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could be extended to additional chemical mechanisms or different regional domains once the core agent logic is shown reliable on the two presented cases.
  • If the system consistently flags evidence gaps, it might reduce the time researchers spend on exhaustive manual diagnostics.
  • Integration with observational datasets could allow the evidence criteria to include direct comparisons against measurements rather than model-internal diagnostics alone.

Load-bearing premise

The multi-agent system can translate mechanistic hypotheses into correct model settings and judge evidence completeness without omitting key physical processes or introducing systematic judgment errors.

What would settle it

A controlled test case in which a known important physical process is omitted from the hypothesis yet the system still declares the evidence complete.

Figures

Figures reproduced from arXiv: 2606.07697 by Fan Meng, Haoluo Zhao, Hongchun Zhang, Jing-Jia Luo, Kaikai Zhang, Mengyang Yu, Nan Chen, Nan Li, Tao Song.

Figure 1
Figure 1. Figure 1: Comparison between the traditional atmospheric-chemistry workflow and the autonomous research loop of TianJi-Environ. The traditional workflow relies on manual coordination among lit￾erature reading, hypothesis construction, WRF-Chem experiment preparation, model execution, post￾processing, and interpretation. TianJi-Environ organizes these steps as a research loop spanning literature synthesis, hypothesis… view at source ↗
Figure 2
Figure 2. Figure 2: End-to-end multi-agent architecture of TianJi-Environ. A coordination layer organizes an open-ended atmospheric environmental problem across literature synthesis, hypothesis organization, WRF-Chem experiment design, diagnostic evidence, and report expression. Diagnostic results, evidence gaps, and scientific evaluation can be fed back as targeted resurvey objectives for later hypothesis refine￾ment and exp… view at source ↗
Figure 3
Figure 3. Figure 3: Evidence-grounded formulation of testable hypotheses for atmospheric environmental mechanism research. Open research questions are first translated into targeted literature-survey objec￾tives around mechanisms, regional conditions, uncertainties, and diagnostic evidence. Literature evidence, case background, mechanistic clues, and future observational constraints are then organized into a trace￾able eviden… view at source ↗
Figure 4
Figure 4. Figure 4: Closed-loop chain from mechanistic hypothesis to WRF-Chem evidence judgement. The system decomposes a mechanism proposition into antecedent conditions, perturbable processes, branch contrasts, diagnostic variables, and evidence criteria. Model execution, diagnostic extraction, evidence gaps, and scientific judgement can therefore be traced back to the causal links specified by the hypothesis. 4 Case Studie… view at source ↗
Figure 5
Figure 5. Figure 5: Integrated diagnosis of the H1 branch experiment for summertime ozone response to NO𝑥 reduction over the North China Plain. The four branches separate the effects of an approximately 30% NO𝑥 reduction, ARI activation, and their combination. ARI effects on SWDOWN and PBLH are direc￾tionally consistent with the expected aerosol radiative feedback that weakens photochemical and boundary￾layer mixing condition… view at source ↗
Figure 6
Figure 6. Figure 6: Spatial-response diagnosis for the H1 branch experiment. The figure compares spatial dif￾ferences in MDA8 O3, SWDOWN, PBLH, and NO2 for ARI, NO𝑥 reduction, and combined perturba￾tions relative to the control branch, and also shows the combined branch relative to the NO𝑥-cut branch. SWDOWN and PBLH show clear but spatially heterogeneous perturbation structures, whereas the MDA8 O3 response is weak and spati… view at source ↗
Figure 7
Figure 7. Figure 7: Integrated diagnosis of the H2 branch experiment for the wintertime black-carbon absorbing-feedback hypothesis in the Guanzhong Basin. The 2×2 factorial branches compare the ef￾fects of ARI on/off and BC-load perturbation on SWDOWN, PBLH, BC1, and PM2.5. The ARI branch produces a shortwave-radiation reduction signal, but the PM2.5 response is close to zero. The high-BC branch is almost identical to the nor… view at source ↗
Figure 8
Figure 8. Figure 8: Daily branch time-series diagnostics for the H2 experiment. a–e, Domain-mean daily trajec￾tories of PM2.5, SWDOWN, PBLH, T2, and BC1 across the four ARI × BC-load branches. Grey denotes no-ARI branches and blue denotes ARI branches; solid circles denote normal-BC branches and open dia￾monds denote high-BC branches. Coincident normal-BC and high-BC trajectories indicate that the current high-BC perturbation… view at source ↗
Figure 9
Figure 9. Figure 9: Research-action reliability and multi-agent workload during the H1 and H2 runs. a, Scale and success rate of tool/API-mediated research actions in the two cases. b, Distribution of Planner routing decisions across target roles. c, Agent-level event counts recorded in the process trace. d, Distribution of tool/API actions across reasoning, design, execution, evidence, and other categories. 5.2 Cross-Stage C… view at source ↗
Figure 10
Figure 10. Figure 10: Coordination trajectory across research stages during the H2 run. a, Sequence of 18 Planner routing decisions from experiment design and input readiness to remote execution and evidence￾to-report synthesis. b, Planner, Scientist, and Executor activity across trace-event indices. c, Number of routing decisions directed to each target role. d, Event counts by role in the process trace. 5.3 Diagnostic Tasks … view at source ↗
Figure 11
Figure 11. Figure 11: Examples from the diagnostic tasks. a, SA-01 MDA8 O3 peak distribution and peak-location identification. b, SA-03 PM2.5 episode diagnosis, showing regional episode selection and pollution-centre identification. c–f, SA-05 O3–meteorology co-variation diagnosis, showing high-ozone-day MDA8 O3, T2, PBLH, and SWDOWN patterns. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
read the original abstract

As atmospheric environmental prediction continues to improve, interpretable validation of pollution mechanisms and feedback processes has become a main challenge in atmospheric chemistry. Yet mechanism validation based on complex numerical models still relies heavily on expert knowledge: mechanistic hypotheses must be operationalized into executable experiments, and model outputs must be organized into traceable evidence. We present TianJi-Environ, an auditable AI Scientist for atmospheric-chemistry mechanism validation. TianJi-Environ establishes the first WRF-Chem-based multi-agent framework that autonomously drives complex atmospheric-chemistry simulations, converting mechanistic hypotheses into executable configurations, testing experiments, and evidence criteria. Using ozone response and particulate-matter feedback as two representative examples, we demonstrate TianJi-Environ's capability for mechanism validation. In a summertime ozone case over the North China Plain, the system detects directionally consistent aerosol-radiation-interaction signals in shortwave radiation and boundary-layer height, but judges the evidence for ozone response to NOx control to be incomplete. In a wintertime PM2.5 case over the Guanzhong Basin, it localizes the unsupported link to insufficient propagation from black-carbon perturbation to particulate response and missing diagnostics of vertical absorptive heating. These results show that TianJi-Environ makes expert-driven mechanism validation explicit, structured, and auditable, offering a reproducible paradigm for multi-agent systems coupled with complex atmospheric-chemistry models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents TianJi-Environ, a multi-agent AI framework coupled with the WRF-Chem model for autonomous validation of atmospheric chemistry mechanisms. The system is claimed to convert mechanistic hypotheses into executable model configurations, run simulations, and assess the completeness of evidence for processes such as aerosol-radiation interactions affecting ozone and black-carbon perturbations on PM2.5. Two case studies are used to illustrate its application: one on summertime ozone over the North China Plain concluding incomplete evidence for NOx control response, and one on wintertime PM2.5 over the Guanzhong Basin identifying missing propagation from black-carbon and vertical heating diagnostics.

Significance. If the AI system's judgments prove reliable upon verification, this work could offer a reproducible and auditable paradigm for mechanism validation in atmospheric environmental research, reducing dependence on individual expert knowledge. The integration of multi-agent systems with complex numerical models like WRF-Chem represents a novel approach that could enhance the traceability of hypothesis testing in the field. The demonstrations suggest potential for identifying gaps in evidence that might be overlooked in traditional workflows.

major comments (3)
  1. [Abstract (ozone case)] Abstract (ozone case): The judgment that 'the evidence for ozone response to NOx control to be incomplete' is presented without detailing the specific evidence criteria, thresholds, or how the multi-agent system evaluates completeness (e.g., whether aerosol-radiation-interaction signals in shortwave radiation and boundary-layer height are quantified against expected magnitudes). This is load-bearing for the claim of autonomous mechanism validation.
  2. [Abstract (PM2.5 case)] Abstract (PM2.5 case): The conclusion that the unsupported link is due to 'insufficient propagation from black-carbon perturbation to particulate response and missing diagnostics of vertical absorptive heating' requires demonstration that the AI framework does not systematically omit other key processes such as aerosol-cloud interactions or regional transport; no cross-validation with expert analysis is mentioned.
  3. [Abstract] Abstract: The paper claims this is the 'first WRF-Chem-based multi-agent framework', but without a methods section detailing the agent architecture, prompt engineering, or integration points with WRF-Chem, it is difficult to assess novelty or reproducibility of the autonomous driving of simulations.
minor comments (2)
  1. [Abstract] The term 'auditable' is used repeatedly but not explicitly defined in terms of what outputs (e.g., logs of agent decisions, model configs) make the process traceable by humans.
  2. [Abstract] No information is provided on the computational resources required or the number of simulations run in the case studies, which would help gauge practicality.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments and recommendation for major revision. We address each point below, clarifying details from the full manuscript and indicating where revisions will strengthen the presentation of evidence criteria, scope limitations, and methods.

read point-by-point responses
  1. Referee: [Abstract (ozone case)] The judgment that 'the evidence for ozone response to NOx control to be incomplete' is presented without detailing the specific evidence criteria, thresholds, or how the multi-agent system evaluates completeness (e.g., whether aerosol-radiation-interaction signals in shortwave radiation and boundary-layer height are quantified against expected magnitudes). This is load-bearing for the claim of autonomous mechanism validation.

    Authors: We agree the abstract is too terse on evaluation criteria. The full manuscript's Methods section specifies the evidence protocol: the assessor agent quantifies signals via normalized differences in shortwave radiation (>5% threshold) and boundary-layer height (>10% threshold) against control runs, then scores completeness on a 0-1 scale requiring consistency across at least three diagnostics. We will revise the abstract to include a concise statement of these criteria and add an explicit cross-reference to the Methods section. revision: yes

  2. Referee: [Abstract (PM2.5 case)] The conclusion that the unsupported link is due to 'insufficient propagation from black-carbon perturbation to particulate response and missing diagnostics of vertical absorptive heating' requires demonstration that the AI framework does not systematically omit other key processes such as aerosol-cloud interactions or regional transport; no cross-validation with expert analysis is mentioned.

    Authors: The referee correctly notes that the current description does not explicitly rule out systematic omissions or include expert cross-validation. The manuscript's Discussion acknowledges the framework evaluates only the user-specified hypothesis set and does not claim exhaustive coverage of all processes. We will add a dedicated Limitations subsection clarifying the targeted scope and stating that expert cross-validation is planned for follow-on work; this addresses the concern without overclaiming completeness. revision: partial

  3. Referee: [Abstract] The paper claims this is the 'first WRF-Chem-based multi-agent framework', but without a methods section detailing the agent architecture, prompt engineering, or integration points with WRF-Chem, it is difficult to assess novelty or reproducibility of the autonomous driving of simulations.

    Authors: The full manuscript contains a Methods section (Section 2) that details the three-agent architecture (planner, executor, assessor), the prompt templates used for hypothesis-to-configuration translation and evidence scoring, and the WRF-Chem integration via namelist generation, output parsing scripts, and restart-file handling. We will revise the abstract to reference this section explicitly and expand one paragraph in Methods to include pseudocode for the integration workflow, thereby supporting both the novelty claim and reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: framework presented as new tool without self-referential derivations

full rationale

The paper introduces TianJi-Environ as an autonomous multi-agent system for WRF-Chem simulations and mechanism validation. The abstract and description frame it as a new methodology converting hypotheses into configurations and evidence criteria, with case studies as demonstrations. No equations, fitted parameters, or self-citations are invoked in a load-bearing way that reduces claims to inputs by construction. The central claim rests on the system's operationalization capability rather than any renaming, ansatz smuggling, or prediction-from-fit pattern. This is a standard tool/framework paper with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on the abstract, the central claim rests on the assumption that the AI framework can perform expert-level validation tasks autonomously.

axioms (1)
  • domain assumption The WRF-Chem model provides an accurate representation of atmospheric chemistry processes for the cases studied.
    The system depends on the underlying numerical model being reliable for mechanism validation.
invented entities (1)
  • TianJi-Environ multi-agent framework no independent evidence
    purpose: Autonomous driving of simulations and evidence evaluation for mechanism validation
    New system proposed in the paper with no external validation mentioned.

pith-pipeline@v0.9.1-grok · 5792 in / 1235 out tokens · 30972 ms · 2026-06-27T20:27:59.474129+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 32 canonical work pages

  1. [1]

    H., and Pandis, S

    Seinfeld, J. H., and Pandis, S. N. (2016). Atmospheric Chemistry and Physics: From Air Pollution to Climate Change. 3rd ed., Wiley

  2. [2]

    Jacob, D. J. (1999). Introduction to Atmospheric Chemistry . Princeton University Press

  3. [3]

    W., and Schere, K

    Byun, D. W., and Schere, K. L. (2006). Review of the governing equations, computational algorithms, and other components of the Models-3 Community Multiscale Air Quality (CMAQ) modeling system. Applied Mechanics Reviews, 59, 51–77. https://doi.org/10.1115/1.2128636

  4. [4]

    A., Peckham, S

    Grell, G. A., Peckham, S. E., Schmitz, R., McKeen, S. A., Frost, G., Skamarock, W. C., and Eder, B. (2005). Fully coupled “online” chemistry within the WRF model. Atmospheric Environment, 39, 6957–6975. https://doi.org/10.1016/j.atmosenv.2005.04.027

  5. [5]

    D., Gustafson, W

    Fast, J. D., Gustafson, W. I., Easter, R. C., Zaveri, R. A., Barnard, J. C., Chapman, E. G., Grell, G. A., and Peckham, S. E. (2006). Evolution of ozone, particulates, and aerosol direct radiative forcing in the vicinity of Houston using a fully coupled meteorology–chemistry–aerosol model. Journal of Geophysical Research: Atmospheres, 111, D21305. https:/...

  6. [6]

    C., Klemp, J

    Skamarock, W. C., Klemp, J. B., Dudhia, J., Gill, D. O., Liu, Z., Berner, J., Wang, W., Powers, J. G., Duda, M. G., Barker, D., and Huang, X.- Y . (2019). A Description of the Advanced Research WRF Model Version 4. NCAR Technical Note NCAR/TN-556+STR. https://doi.org/10.5065/1dfh-6p97

  7. [7]

    Zhang, Y . (2008). Online-coupled meteorology and chemistry models: history, current status, and outlook. Atmospheric Chemistry and Physics , 8, 2895–2932. https://doi.org/10.5194/acp-8-2895- 2008

  8. [8]

    Baklanov, A., Schlünzen, K., Suppan, P ., Baldasano, J., Brunner, D., Aksoyoglu, S., Carmichael, G., Douros, J., Flemming, J., Forkel, R., et al. (2014). Online coupled regional meteorology chemistry models in Europe: current status and prospects. Atmospheric Chemistry and Physics , 14, 317–398. https://doi.org/10.5194/acp-14-317-2014

  9. [9]

    Gao, M., Xiu, A., Zhang, X., Tong, D., Zhao, H., Liu, S., Zhang, S., Meng, X., Chen, X., Cai, S., et al. (2022). T wo-way coupled meteorology and air quality models in Asia: a systematic review and meta-analysis of impacts of aerosol feedbacks on meteorology and air quality.Atmospheric Chemistry and Physics, 22, 5265–5329. https://doi.org/10.5194/acp-22-5265-2022

  10. [10]

    Y ang, H., Chen, L., Liao, H., Zhu, J., Wang, W., and Li, X. (2022). Impacts of aerosol– photolysis interaction and aerosol–radiation feedback on surface-layer ozone in North China dur- ing multi-pollutant air pollution episodes. Atmospheric Chemistry and Physics , 22, 4101–4116. https://doi.org/10.5194/acp-22-4101-2022. 17

  11. [11]

    Li, X., Qin, M., Li, L., Gong, K., Shen, H., Li, J., and Hu, J. (2022). Examining the implica- tions of photochemical indicators for O 3–NO𝑥–VOC sensitivity and control strategies: a case study in the Y angtze River Delta (YRD), China. Atmospheric Chemistry and Physics , 22, 14799–14811. https://doi.org/10.5194/acp-22-14799-2022

  12. [12]

    Wu, J., Bei, N., Hu, B., Liu, S., Zhou, M., Wang, Q., Li, X., Liu, L., Feng, T., Liu, Z., et al. (2019). Aerosol–radiation feedback deteriorates the wintertime haze in the North China Plain. Atmospheric Chemistry and Physics , 19, 8703–8719. https://doi.org/10.5194/acp-19-8703-2019

  13. [13]

    Li, J., Han, Z., Wu, Y ., Xiong, Z., Xia, X., Li, J., Liang, L., and Zhang, R. (2020). Aerosol radiative effects and feedbacks on boundary layer meteorology and PM2.5 chemical components during winter haze events over the Beijing–Tianjin–Hebei region. Atmospheric Chemistry and Physics , 20, 8659–

  14. [14]

    https://doi.org/10.5194/acp-20-8659-2020

  15. [15]

    J., Sun, J

    Petäjä, T., Järvi, L., Kerminen, V .-M., Ding, A. J., Sun, J. N., Nie, W., Kujansuu, J., Virkkula, A., Y ang, X., Fu, C. B., Zilitinkevich, S., and Kulmala, M. (2016). Enhanced air pollution via aerosol- boundary layer feedback in China. Scientific Reports, 6, 18998. https://doi.org/10.1038/srep18998

  16. [16]

    J., Huang, X., Nie, W., Sun, J

    Ding, A. J., Huang, X., Nie, W., Sun, J. N., Kerminen, V .-M., Petäjä, T., Su, H., Cheng, Y . F., Y ang, X.-Q., Wang, M. H., et al. (2016). Enhanced haze pollution by black carbon in megacities in China. Geophysical Research Letters, 43, 2873–2879. https://doi.org/10.1002/2016GL067745

  17. [17]

    Wang, Z., Huang, X., and Ding, A. (2018). Dome effect of black carbon and its key influencing factors: a one-dimensional modelling study. Atmospheric Chemistry and Physics , 18, 2821–2834. https://doi.org/10.5194/acp-18-2821-2018

  18. [18]

    Sillman, S. (1995). The use of NO 𝑦, H 2O2, and HNO 3 as indicators for ozone–NO 𝑥– hydrocarbon sensitivity in urban locations.Journal of Geophysical Research, 100(D7), 14175–14188. https://doi.org/10.1029/94JD02953

  19. [19]

    Sillman, S. (1999). The relation between ozone, NO 𝑥 and hydrocarbons in urban and pol- luted rural environments. Atmospheric Environment, 33, 1821–1845. https://doi.org/10.1016/S1352- 2310(98)00345-8

  20. [20]

    N., Y oshida, Y ., Olson, J

    Duncan, B. N., Y oshida, Y ., Olson, J. R., Sillman, S., Martin, R. V ., Lamsal, L., Hu, Y ., Pickering, K. E., Retscher, C., Allen, D. J., and Crawford, J. H. (2010). Application of OMI observations to a space- based indicator of NO 𝑥 and VOC controls on surface ozone formation. Atmospheric Environment, 44, 2213–2223. https://doi.org/10.1016/j.atmosenv.2...

  21. [21]

    Jin, X., and Holloway, T. (2015). Spatial and temporal variability of ozone sensitivity over China observed from the Ozone Monitoring Instrument. Journal of Geophysical Research: Atmospheres , 120, 7229–7246. https://doi.org/10.1002/2015JD023250

  22. [22]

    Bi, K., Xie, L., Zhang, H., Chen, X., Gu, X., and Tian, Q. (2023). Accurate medium-range global weather forecasting with 3D neural networks. Nature, 619, 533–538. https://doi.org/10.1038/s41586- 023-06185-3

  23. [23]

    Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger, P ., Fortunato, M., Alet, F., Ravuri, S., Ewalds, T., Eaton-Rosen, Z., Hu, W., et al. (2023). Learning skillful medium-range global weather forecasting. Science, 382, 1416–1421. https://doi.org/10.1126/science.adi2336

  24. [24]

    Price, A

    Price, I., Sanchez-Gonzalez, A., Alet, F., Andersson, T. R., El-Kadi, A., Masters, D., Ewalds, T., Stott, J., Mohamed, S., Battaglia, P ., Lam, R., and Willson, M. (2025). Probabilistic weather forecasting with machine learning. Nature, 637, 84–90. https://doi.org/10.1038/s41586-024-08252-9. 18

  25. [25]

    Bodnar, W

    Bodnar, C., Bruinsma, W. P ., Lucic, A., Stanley, M., Allen, A., Brandstetter, J., Garvan, P ., Riechert, M., Weyn, J. A., Dong, H., et al. (2025). A foundation model for the Earth system. Nature, 641, 1180–1187. https://doi.org/10.1038/s41586-025-09005-y

  26. [26]

    Gui, K. et al. (2026). Advancing operational global aerosol forecasting with machine learning.Nature, 651, 658–665. https://doi.org/10.1038/s41586-026-10234-y

  27. [27]

    H., Steinbach, M., Banerjee, A., Ganguly, A., Shekhar, S., Samatova, N., and Kumar, V

    Karpatne, A., Atluri, G., Faghmous, J. H., Steinbach, M., Banerjee, A., Ganguly, A., Shekhar, S., Samatova, N., and Kumar, V . (2017). Theory-guided data science: a new paradigm for scien- tific discovery from data. IEEE Transactions on Knowledge and Data Engineering , 29, 2318–2331. https://doi.org/10.1109/TKDE.2017.2720168

  28. [28]

    Reichstein, M., Camps-Valls, G., Stevens, B., Jung, M., Denzler, J., Carvalhais, N., and Prabhat. (2019). Deep learning and process understanding for data-driven Earth system science. Nature, 566, 195–204. https://doi.org/10.1038/s41586-019-0912-1

  29. [29]

    Guo, Z., Wang, J., Ling, F., Wei, W., Yue, X., Jiang, Z., Xu, W., Luo, J.-J., Cheng, L., Ham, Y .-G., et al. (2025). A self-evolving AI agent system for climate science. arXiv preprint arXiv:2507.17311. https://doi.org/10.48550/arXiv.2507.17311

  30. [30]

    Feng, P ., Lv, Z., Y e, J., Wang, X., Huo, X., Yu, J., Xu, W., Zhang, W., Bai, L., He, C., and Li, W. (2025). Earth-Agent: Unlocking the full landscape of Earth observation with agents. arXiv preprint arXiv:2509.23141. https://doi.org/10.48550/arXiv.2509.23141

  31. [31]

    Brown, T. B. et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901

  32. [32]

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837

  33. [33]

    L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774

  34. [34]

    Y ao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y . (2023). ReAct: Synergizing reasoning and acting in language models. International Conference on Learning Representations

  35. [35]

    Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Y ao, S. (2023). Reflexion: language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems , 36, 8634– 8652

  36. [36]

    Wang, L., Ma, C., Feng, X., Zhang, Z., Y ang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y ., et al. (2024). A survey on large language model based autonomous agents. Frontiers of Computer Science, 18, 186345. https://doi.org/10.1007/s11704-024-40231-1

  37. [37]

    Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes

    Boiko, D. A., MacKnight, R., Kline, B., and Gomes, G. (2023). Autonomous chemical research with large language models. Nature, 624, 570–578. https://doi.org/10.1038/s41586-023-06792-0

  38. [38]

    Bran, Sam Cox, Oliver Schilter, et al

    Bran, A. M., Cox, S., Schilter, O., Baldassari, C., White, A. D., and Schwaller, P . (2024). Aug- menting large language models with chemistry tools. Nature Machine Intelligence , 6, 525–535. https://doi.org/10.1038/s42256-024-00832-8

  39. [39]

    Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y ., Zhang, C., Wang, J., Wang, Z., Y au, S. K. S., Lin, Z., et al. (2023). MetaGPT: Meta programming for a multi-agent collaborative framework. arXiv preprint arXiv:2308.00352. 19

  40. [40]

    Wu, Q., Bansal, G., Zhang, J., Wu, Y ., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., et al. (2023). AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv preprint arXiv:2308.08155

  41. [41]

    T., Foerster, J., Clune, J., and Ha, D

    Lu, C., Lu, C., Lange, R. T., Foerster, J., Clune, J., and Ha, D. (2024). The AI Scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292

  42. [42]

    Ghafarollahi, A., and Buehler, M. J. (2025). SciAgents: Automating scientific discovery through bioinspired multi-agent intelligent graph reasoning. Advanced Materials , 37, 2413523. https://doi.org/10.1002/adma.202413523

  43. [43]

    Wang, H., Fu, T., Du, Y ., Gao, W., Huang, K., Liu, Z., Chandak, P ., Liu, S., Van Katwyk, P ., Deac, A., et al. (2023). Scientific discovery in the age of artificial intelligence. Nature, 620, 47–60. https://doi.org/10.1038/s41586-023-06221-2. 20