CMIP-Forge: An Agentic System that Retrieves, Computes, and Self-Reviews Climate Science

Boris Shapkin; Dmitrii Pantiukhin; Ivan Kuznetsov; Nikolay Koldunov; Thomas Jung

arxiv: 2606.17076 · v1 · pith:DWIJ3UTRnew · submitted 2026-06-10 · ⚛️ physics.ao-ph · cs.AI

CMIP-Forge: An Agentic System that Retrieves, Computes, and Self-Reviews Climate Science

Dmitrii Pantiukhin , Boris Shapkin , Ivan Kuznetsov , Thomas Jung , Nikolay Koldunov This is my paper

Pith reviewed 2026-06-27 07:29 UTC · model grok-4.3

classification ⚛️ physics.ao-ph cs.AI

keywords agentic systemsclimate modelingretrieval-augmented generationCMIP6autonomous workflowsadversarial reviewEarth system data

0 comments

The pith

An agentic system can autonomously retrieve CMIP6 literature, generate code for live data analysis, and audit its own workflows through layered guardrails.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CMIP-Forge as a hybrid system that pairs a large curated corpus of CMIP6 publications with an agent that plans and runs Python code against Earth system data archives. It adds multiple automated checks, including static code analysis and a separate panel of reviewer models that examine the full methodology. The goal is to overcome the manual effort required to turn thousands of papers and massive data collections into finished research tasks such as studying teleconnections or regional extremes. If the approach holds, it would let research pipelines run end to end without constant human direction while still grounding results in published science.

Core claim

CMIP-Forge demonstrates that an agentic analysis system grounded in peer-reviewed literature, constrained by automated code guardrails, and audited by an independent adversarial review loop can complete complex climate-research workflows autonomously, as shown through pipelines on atmospheric teleconnections, ocean dynamics, regional extremes, and global warming projections.

What carries the argument

The multi-layered Defense-in-Depth architecture that combines AST static analysis, audited scientific primitives, and an autonomous adversarial peer-review protocol to enforce physical and methodological invariants.

If this is right

End-to-end autonomous pipelines become feasible for tasks that currently require teams to sift through literature and data manually.
The same architecture can support the transition from CMIP6 to CMIP7 by turning unstructured publications into operational analysis routines.
Failure modes such as sycophantic regression or unresolved review verdicts become detectable through the released immutable telemetry.
Provenance records for every step allow later inspection of how literature, code, and review decisions combined.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retrieval-plus-review pattern could be tested on other data-rich fields that maintain large open archives and publication corpora.
If the guardrails scale, the time between identifying a question in the literature and obtaining a first data-driven answer could shrink substantially.
Extending the reviewer panel to include models fine-tuned on domain-specific error patterns might further reduce undetected mistakes.

Load-bearing premise

The layered checks and independent review loop are enough to catch and fix errors in generated workflows without any human intervention.

What would settle it

A generated workflow that produces results violating known physical constraints yet passes the full review loop and is accepted as valid.

Figures

Figures reproduced from arXiv: 2606.17076 by Boris Shapkin, Dmitrii Pantiukhin, Ivan Kuznetsov, Nikolay Koldunov, Thomas Jung.

**Figure 1.** Figure 1: CMIP-Forge agentic architecture. A user prompt is consumed by a ReAct worker agent (LangGraph) whose system prompt encodes nine geophysical invariants, seven failure-mode exemplars, and the Empirical Defiance Protocol. The agent has access to fourteen tools grouped into five categories. Literature retrieval is backed by a Qdrant hybrid-search index (dense Gemini Embedding 2 plus sparse BM25 across 101,828 … view at source ↗

**Figure 2.** Figure 2: Upstream oceanic diagnostic: AMOC kinematic fingerprint and 15-model historical [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Downstream atmospheric response: model-dependent European shielding effect under [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Projected evolution of ENSO amplitude and frequency, 1950–2100, under SSP5-8.5 for [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Projected change in oceanic frontal sharpness [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Projected Mediterranean summer (JJA) warming, 1960–2100, for the three carry [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Structural evolution of the North Atlantic Oscillation across three 30-year windows. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Three regional precipitation regimes under SSP5-8.5, 1950–2100. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Constrained-ensemble GMST projections, 1960–2100, under three SSP scenarios. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

read the original abstract

The Coupled Model Intercomparison Project Phase 6 (CMIP6) has generated thousands of peer-reviewed publications documenting model configurations, evaluation procedures, emergent constraints, and projection uncertainties. As the community transitions toward CMIP7, efficiently extracting and operationalizing this unstructured knowledge alongside live data analysis represents a critical bottleneck. Here we present CMIP-Forge, a hybrid retrieval-augmented generation (RAG) and autonomous analysis system that bridges the gap between scientific literature and Earth System Grid Federation (ESGF) data archives. The system pairs a curated corpus of 6,581 CMIP6-related open-access publications (101,828 indexed chunks) with an agentic pipeline in which a tool-augmented worker plans and executes Python workflows over live climate data, while a panel of independent reviewer models audits its methodology end to end. CMIP-Forge introduces a multi-layered Defense-in-Depth architecture that enforces physical and methodological invariants through executable mechanisms: Abstract Syntax Tree (AST) static analysis, audited scientific primitives, and an autonomous adversarial peer-review protocol. We demonstrate the system's capabilities through end-to-end autonomous research pipelines spanning atmospheric teleconnections, ocean dynamics, regional extremes, and global warming projections. An agentic analysis system grounded in peer-reviewed literature, constrained by automated code guardrails, and audited by an independent adversarial review loop can complete complex climate-research workflows autonomously. The same experiments expose concrete failure modes of the review loop (sycophantic regression, REVISE verdicts that are never resolved, and the submission of stub code for review), each diagnosable from the immutable telemetry and provenance record released with the article.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CMIP-Forge puts together RAG, code agents, and an adversarial reviewer panel for CMIP6 workflows and honestly logs where the review layer breaks, but the absence of success metrics leaves the autonomy claim unproven.

read the letter

The main point is that this paper describes a named system that links a large CMIP6 literature corpus to live ESGF data through an agent that writes and runs Python code, with AST checks and a separate reviewer panel meant to catch errors. They show example pipelines on teleconnections, ocean dynamics, and extremes, and they release the telemetry so the failure cases are visible.

What stands out as new is the concrete combination for this domain: the curated 6,581-paper index, the tool-augmented worker, and the multi-layer guardrails including audited primitives and the adversarial loop. Releasing the logs that document sycophantic regression, unresolved REVISE verdicts, and stub code is also useful; it turns the experiments into a record of where current LLM setups still need human oversight.

The paper does a reasonable job of being transparent about those limits instead of claiming the system is already reliable. The soft spot is the missing numbers. The abstract talks about end-to-end demonstrations and diagnosable failure modes but gives no success rates, error distributions, or human baseline comparisons. Without those, it is hard to judge whether the Defense-in-Depth layers actually reduce mistakes enough to matter in practice or whether the reported failures are the typical case.

This is the sort of work that would interest people building AI tools for Earth-system data analysis. A reader who wants to see how agentic pipelines behave on real scientific tasks could get value from the architecture and the failure telemetry. It deserves a serious referee because the implementation is grounded in an actual bottleneck and the honesty about limitations gives referees something concrete to evaluate.

Referee Report

2 major / 1 minor

Summary. The paper presents CMIP-Forge, a hybrid RAG and agentic system pairing a corpus of 6,581 CMIP6 publications with tool-augmented LLM workers that plan and execute Python workflows on ESGF data. A multi-layered Defense-in-Depth architecture (AST static analysis, audited scientific primitives, autonomous adversarial peer-review protocol) is claimed to enforce physical and methodological invariants. End-to-end autonomous pipelines are demonstrated across teleconnections, ocean dynamics, extremes, and projections; the same experiments expose review-loop failure modes (sycophantic regression, unresolved REVISE verdicts, stub-code submissions) diagnosable from released telemetry.

Significance. If the architecture reliably enables fully autonomous, error-correcting workflows grounded in peer-reviewed literature, the system could materially reduce the bottleneck between CMIP6 knowledge and live data analysis. The release of immutable telemetry and provenance records is a concrete strength for reproducibility. However, the documented failure modes indicate that the autonomous review layer does not consistently surface or correct errors, limiting the immediate significance for production climate-research use.

major comments (2)

[Abstract] Abstract: the central claim that 'an agentic analysis system ... can complete complex climate-research workflows autonomously' lacks any quantitative success rates, error distributions, or human-baseline comparisons for the demonstrated pipelines; only qualitative demonstrations and failure modes are described.
[Abstract] Abstract: the explicit listing of review-loop failure modes (sycophantic regression, REVISE verdicts that are never resolved, submission of stub code) directly tests and appears to falsify the weakest assumption that the Defense-in-Depth architecture (AST analysis + audited primitives + autonomous adversarial review) suffices to enforce invariants without human intervention.

minor comments (1)

[Abstract] The abstract states that pipelines span 'atmospheric teleconnections, ocean dynamics, regional extremes, and global warming projections' but supplies no concrete results, figures, or section references for any of these demonstrations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for major revision. Below we respond point by point to the major comments on the abstract. We agree that the presentation of capabilities and limitations can be clarified and will make targeted revisions to the abstract and related text.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'an agentic analysis system ... can complete complex climate-research workflows autonomously' lacks any quantitative success rates, error distributions, or human-baseline comparisons for the demonstrated pipelines; only qualitative demonstrations and failure modes are described.

Authors: The experiments in the manuscript were designed as qualitative end-to-end demonstrations of the system on representative climate tasks together with a diagnostic analysis of review-loop behavior via released telemetry. Quantitative success rates, error distributions, and human baselines were not computed as part of this work. We will revise the abstract to remove any implication of comprehensive quantitative validation and will add a brief summary of the number of pipelines executed and the observed outcomes where these counts can be extracted directly from the published telemetry. revision: partial
Referee: [Abstract] Abstract: the explicit listing of review-loop failure modes (sycophantic regression, REVISE verdicts that are never resolved, submission of stub code) directly tests and appears to falsify the weakest assumption that the Defense-in-Depth architecture (AST analysis + audited primitives + autonomous adversarial review) suffices to enforce invariants without human intervention.

Authors: We do not interpret the reported failure modes as falsifying the manuscript's claims. The abstract asserts only that an agentic system 'can complete complex climate-research workflows autonomously' under the described constraints; the successful demonstrations support this existential claim. The failure modes are presented explicitly to document current limitations of the autonomous review layer and to show that the immutable telemetry makes those limitations diagnosable. The Defense-in-Depth mechanisms are not asserted to eliminate all need for human oversight in every case. We will consider a minor rephrasing of the abstract to make this scope explicit if the editor deems it necessary. revision: no

Circularity Check

0 steps flagged

No derivation chain present; system capability claim is not a reduction of quantities

full rationale

The paper presents a system architecture and end-to-end demonstrations rather than any mathematical derivation, first-principles result, fitted parameter, or prediction that could reduce to its own inputs. No equations, ansatzes, uniqueness theorems, or self-citations of load-bearing results appear in the abstract or described content. The central statement is an empirical capability claim about an agentic workflow, and the documented failure modes are presented as observations from the same experiments rather than hidden equivalences. This matches the default case of a self-contained descriptive paper with no circularity to flag.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the untested effectiveness of the introduced guardrails and review protocol; no new physical constants or free parameters are introduced, but the architecture itself is postulated without external validation data in the provided text.

axioms (2)

domain assumption Live ESGF data archives remain accessible and queryable via Python during autonomous execution.
The agentic pipeline is described as operating over live archives; this is presupposed for any workflow to complete.
ad hoc to paper LLM-generated Python code can be constrained to valid scientific primitives by AST analysis and reviewer models.
The Defense-in-Depth architecture is presented as enforcing invariants through these mechanisms.

invented entities (1)

Defense-in-Depth architecture no independent evidence
purpose: Enforce physical and methodological invariants via AST static analysis, audited primitives, and adversarial peer-review.
Introduced as the core novel mechanism of CMIP-Forge.

pith-pipeline@v0.9.1-grok · 5847 in / 1531 out tokens · 19475 ms · 2026-06-27T07:29:35.363476+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 15 canonical work pages

[1]

Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes

Boiko, D. A., MacKnight, R., Kline, B., and Gomes, G. (2023). Autonomous chemical research with large language models. Nature, 624, 570--578. doi:10.1038/s41586-023-06792-0

work page doi:10.1038/s41586-023-06792-0 2023
[2]

A., Adeli, E., et al

Bommasani, R., Hudson, D. A., Adeli, E., et al. (2021). On the opportunities and risks of foundation models. Preprint, arXiv:2108.07258

Pith/arXiv arXiv 2021
[3]

Chen, Y., Wang, W., Lobry, S., and Kurtz, C. (2024). An LLM agent for automatic geospatial data analysis. Preprint, arXiv:2410.18792

arXiv 2024
[4]

In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Cormack, G. V., Clarke, C. L., and Buettcher, S. (2009). Reciprocal rank fusion outperforms Condorcet and individual rank learning methods. Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 758--759. doi:10.1145/1571941.1572114

work page doi:10.1145/1571941.1572114 2009
[5]

Deng, C., et al. (2024). K2: A foundation language model for geoscience knowledge understanding and utilization. Proceedings of WSDM 2024, pp. 161--170. doi:10.1145/3616855.3635772

work page doi:10.1145/3616855.3635772 2024
[6]

and Ditlevsen, S

Ditlevsen, P. and Ditlevsen, S. (2023). Warning of a forthcoming collapse of the Atlantic meridional overturning circulation. Nature Communications, 14, 4254. doi:10.1038/s41467-023-39810-w

work page doi:10.1038/s41467-023-39810-w 2023
[7]

Dong, H., Niu, J., Wang, B., Zeng, W., Zhang, W., and He, C. (2026). MinerU-Diffusion: Rethinking document OCR as inverse rendering via diffusion decoding. Preprint, arXiv:2603.22458

arXiv 2026
[8]

B., Scaife, A

Eade, R., Stephenson, D. B., Scaife, A. A., and Smith, D. M. (2024). Recalibration of missing low-frequency variability and trends in the North Atlantic Oscillation. Climate Dynamics, 62, 7869--7887. doi:10.1007/s00382-024-07311-1

work page doi:10.1007/s00382-024-07311-1 2024
[9]

2016 Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6) experimental design and organization.Geoscientific Model Development9, 1937–1958

Eyring, V., Bony, S., Meehl, G. A., et al. (2016). Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6): Experimental design and organization. Geoscientific Model Development, 9, 1937--1958. doi:10.5194/gmd-9-1937-2016

work page doi:10.5194/gmd-9-1937-2016 2016
[10]

FastEmbed: A lightweight Python library for fast embedding generation

Qdrant Solutions GmbH (2024). FastEmbed: A lightweight Python library for fast embedding generation. https://github.com/qdrant/fastembed

2024
[11]

Guo, T., et al. (2024). Large language model based multi-agents: A survey of progress and challenges. Preprint, arXiv:2402.01680

Pith/arXiv arXiv 2024
[12]

Hersbach, H., Bell, B., Berrisford, P., et al. (2020). The ERA5 global reanalysis. Quarterly Journal of the Royal Meteorological Society, 146, 1999--2049. doi:10.1002/qj.3803

work page doi:10.1002/qj.3803 2020
[13]

Hong, S., et al. (2024). MetaGPT: Meta programming for a multi-agent collaborative framework. ICLR 2024; arXiv:2308.00352

Pith/arXiv arXiv 2024
[14]

Huai, B., et al. (2025). Future large-scale atmospheric circulation changes and Greenland precipitation. npj Climate and Atmospheric Science, 8, 10. doi:10.1038/s41612-025-00899-z

work page doi:10.1038/s41612-025-00899-z 2025
[15]

Koldunov, N. V. and Jung, T. (2024). Local climate services for all, courtesy of large language models. Communications Earth & Environment, 5, 13. doi:10.1038/s43247-023-01199-1

work page doi:10.1038/s43247-023-01199-1 2024
[16]

A., Pantiukhin, D., et al

Kuznetsov, I., Jost, A. A., Pantiukhin, D., et al. (2025). Transforming climate services with LLMs and multi-source data integration. npj Climate Action, 4, 97. doi:10.1038/s44168-025-00300-y

work page doi:10.1038/s44168-025-00300-y 2025
[17]

Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y., Wang, R., Yang, Y., Shi, S., and Tu, Z. (2024). Encouraging divergent thinking in large language models through multi-agent debate. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). doi:10.18653/v1/2024.emnlp-main.992

work page doi:10.18653/v1/2024.emnlp-main.992 2024
[18]

McKenna, C. M. and Maycock, A. C. (2021). Sources of uncertainty in multimodel large ensemble projections of the winter North Atlantic Oscillation. Geophysical Research Letters, 48, e2021GL093258. doi:10.1029/2021GL093258

work page doi:10.1029/2021gl093258 2021
[19]

Mitevski, I., et al. (2025). More positive and less variable North Atlantic Oscillation at high CO _2 forcing. npj Climate and Atmospheric Science, 8, 171. doi:10.1038/s41612-025-01051-7

work page doi:10.1038/s41612-025-01051-7 2025
[20]

Pantiukhin, D., et al. (2025). Accelerating earth science discovery via multi-agent LLM systems. Frontiers in Artificial Intelligence, 8. doi:10.3389/frai.2025.1674927

work page doi:10.3389/frai.2025.1674927 2025
[21]

Pantiukhin, D., et al. (2026). A hierarchical multi-agent system for autonomous discovery in geoscientific data archives. Preprint, arXiv:2602.21351

arXiv 2026
[22]

Qdrant: Open-source vector similarity search engine

Qdrant Solutions GmbH (2024). Qdrant: Open-source vector similarity search engine. https://qdrant.tech

2024
[23]

Schick, T., et al. (2023). Toolformer: Language models can teach themselves to use tools. NeurIPS 2023; arXiv:2302.04761

Pith/arXiv arXiv 2023
[24]

Song, X., Yin, Z., and Wang, H. (2024). Interdecadal changes in the links between late-winter NAO and North Atlantic tripole SST and possible mechanism. Geophysical Research Letters, 51, e2024GL110138. doi:10.1029/2024GL110138

work page doi:10.1029/2024gl110138 2024
[25]

Thulke, D., et al. (2024). ClimateGPT: Towards AI synthesizing interdisciplinary research on climate change. Preprint, arXiv:2401.09646

arXiv 2024
[26]

Wang, B., Xu, C., Zhao, X., Ouyang, L., Wu, F., Zhao, Z., Xu, R., Liu, K., Qu, Y., Shang, F., et al. (2024). MinerU: An open-source solution for precise document content extraction. Preprint, arXiv:2409.18839

Pith/arXiv arXiv 2024
[27]

Wang, B., He, T., Ouyang, L., Wu, F., Zhao, Z., Chu, T., Qu, Y., Jin, Z., Zeng, W., Miao, Z., et al. (2026). MinerU2.5-Pro: Pushing the limits of data-centric document parsing at scale. Preprint, arXiv:2604.04771

Pith/arXiv arXiv 2026
[28]

Yao, S., et al. (2023). ReAct: Synergizing reasoning and acting in language models. ICLR 2023; arXiv:2210.03629

Pith/arXiv arXiv 2023

[1] [1]

Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes

Boiko, D. A., MacKnight, R., Kline, B., and Gomes, G. (2023). Autonomous chemical research with large language models. Nature, 624, 570--578. doi:10.1038/s41586-023-06792-0

work page doi:10.1038/s41586-023-06792-0 2023

[2] [2]

A., Adeli, E., et al

Bommasani, R., Hudson, D. A., Adeli, E., et al. (2021). On the opportunities and risks of foundation models. Preprint, arXiv:2108.07258

Pith/arXiv arXiv 2021

[3] [3]

Chen, Y., Wang, W., Lobry, S., and Kurtz, C. (2024). An LLM agent for automatic geospatial data analysis. Preprint, arXiv:2410.18792

arXiv 2024

[4] [4]

In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Cormack, G. V., Clarke, C. L., and Buettcher, S. (2009). Reciprocal rank fusion outperforms Condorcet and individual rank learning methods. Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 758--759. doi:10.1145/1571941.1572114

work page doi:10.1145/1571941.1572114 2009

[5] [5]

Deng, C., et al. (2024). K2: A foundation language model for geoscience knowledge understanding and utilization. Proceedings of WSDM 2024, pp. 161--170. doi:10.1145/3616855.3635772

work page doi:10.1145/3616855.3635772 2024

[6] [6]

and Ditlevsen, S

Ditlevsen, P. and Ditlevsen, S. (2023). Warning of a forthcoming collapse of the Atlantic meridional overturning circulation. Nature Communications, 14, 4254. doi:10.1038/s41467-023-39810-w

work page doi:10.1038/s41467-023-39810-w 2023

[7] [7]

Dong, H., Niu, J., Wang, B., Zeng, W., Zhang, W., and He, C. (2026). MinerU-Diffusion: Rethinking document OCR as inverse rendering via diffusion decoding. Preprint, arXiv:2603.22458

arXiv 2026

[8] [8]

B., Scaife, A

Eade, R., Stephenson, D. B., Scaife, A. A., and Smith, D. M. (2024). Recalibration of missing low-frequency variability and trends in the North Atlantic Oscillation. Climate Dynamics, 62, 7869--7887. doi:10.1007/s00382-024-07311-1

work page doi:10.1007/s00382-024-07311-1 2024

[9] [9]

2016 Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6) experimental design and organization.Geoscientific Model Development9, 1937–1958

Eyring, V., Bony, S., Meehl, G. A., et al. (2016). Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6): Experimental design and organization. Geoscientific Model Development, 9, 1937--1958. doi:10.5194/gmd-9-1937-2016

work page doi:10.5194/gmd-9-1937-2016 2016

[10] [10]

FastEmbed: A lightweight Python library for fast embedding generation

Qdrant Solutions GmbH (2024). FastEmbed: A lightweight Python library for fast embedding generation. https://github.com/qdrant/fastembed

2024

[11] [11]

Guo, T., et al. (2024). Large language model based multi-agents: A survey of progress and challenges. Preprint, arXiv:2402.01680

Pith/arXiv arXiv 2024

[12] [12]

Hersbach, H., Bell, B., Berrisford, P., et al. (2020). The ERA5 global reanalysis. Quarterly Journal of the Royal Meteorological Society, 146, 1999--2049. doi:10.1002/qj.3803

work page doi:10.1002/qj.3803 2020

[13] [13]

Hong, S., et al. (2024). MetaGPT: Meta programming for a multi-agent collaborative framework. ICLR 2024; arXiv:2308.00352

Pith/arXiv arXiv 2024

[14] [14]

Huai, B., et al. (2025). Future large-scale atmospheric circulation changes and Greenland precipitation. npj Climate and Atmospheric Science, 8, 10. doi:10.1038/s41612-025-00899-z

work page doi:10.1038/s41612-025-00899-z 2025

[15] [15]

Koldunov, N. V. and Jung, T. (2024). Local climate services for all, courtesy of large language models. Communications Earth & Environment, 5, 13. doi:10.1038/s43247-023-01199-1

work page doi:10.1038/s43247-023-01199-1 2024

[16] [16]

A., Pantiukhin, D., et al

Kuznetsov, I., Jost, A. A., Pantiukhin, D., et al. (2025). Transforming climate services with LLMs and multi-source data integration. npj Climate Action, 4, 97. doi:10.1038/s44168-025-00300-y

work page doi:10.1038/s44168-025-00300-y 2025

[17] [17]

Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y., Wang, R., Yang, Y., Shi, S., and Tu, Z. (2024). Encouraging divergent thinking in large language models through multi-agent debate. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). doi:10.18653/v1/2024.emnlp-main.992

work page doi:10.18653/v1/2024.emnlp-main.992 2024

[18] [18]

McKenna, C. M. and Maycock, A. C. (2021). Sources of uncertainty in multimodel large ensemble projections of the winter North Atlantic Oscillation. Geophysical Research Letters, 48, e2021GL093258. doi:10.1029/2021GL093258

work page doi:10.1029/2021gl093258 2021

[19] [19]

Mitevski, I., et al. (2025). More positive and less variable North Atlantic Oscillation at high CO _2 forcing. npj Climate and Atmospheric Science, 8, 171. doi:10.1038/s41612-025-01051-7

work page doi:10.1038/s41612-025-01051-7 2025

[20] [20]

Pantiukhin, D., et al. (2025). Accelerating earth science discovery via multi-agent LLM systems. Frontiers in Artificial Intelligence, 8. doi:10.3389/frai.2025.1674927

work page doi:10.3389/frai.2025.1674927 2025

[21] [21]

Pantiukhin, D., et al. (2026). A hierarchical multi-agent system for autonomous discovery in geoscientific data archives. Preprint, arXiv:2602.21351

arXiv 2026

[22] [22]

Qdrant: Open-source vector similarity search engine

Qdrant Solutions GmbH (2024). Qdrant: Open-source vector similarity search engine. https://qdrant.tech

2024

[23] [23]

Schick, T., et al. (2023). Toolformer: Language models can teach themselves to use tools. NeurIPS 2023; arXiv:2302.04761

Pith/arXiv arXiv 2023

[24] [24]

Song, X., Yin, Z., and Wang, H. (2024). Interdecadal changes in the links between late-winter NAO and North Atlantic tripole SST and possible mechanism. Geophysical Research Letters, 51, e2024GL110138. doi:10.1029/2024GL110138

work page doi:10.1029/2024gl110138 2024

[25] [25]

Thulke, D., et al. (2024). ClimateGPT: Towards AI synthesizing interdisciplinary research on climate change. Preprint, arXiv:2401.09646

arXiv 2024

[26] [26]

Wang, B., Xu, C., Zhao, X., Ouyang, L., Wu, F., Zhao, Z., Xu, R., Liu, K., Qu, Y., Shang, F., et al. (2024). MinerU: An open-source solution for precise document content extraction. Preprint, arXiv:2409.18839

Pith/arXiv arXiv 2024

[27] [27]

Wang, B., He, T., Ouyang, L., Wu, F., Zhao, Z., Chu, T., Qu, Y., Jin, Z., Zeng, W., Miao, Z., et al. (2026). MinerU2.5-Pro: Pushing the limits of data-centric document parsing at scale. Preprint, arXiv:2604.04771

Pith/arXiv arXiv 2026

[28] [28]

Yao, S., et al. (2023). ReAct: Synergizing reasoning and acting in language models. ICLR 2023; arXiv:2210.03629

Pith/arXiv arXiv 2023