pith. sign in

arxiv: 2606.17076 · v1 · pith:DWIJ3UTRnew · submitted 2026-06-10 · ⚛️ physics.ao-ph · cs.AI

CMIP-Forge: An Agentic System that Retrieves, Computes, and Self-Reviews Climate Science

Pith reviewed 2026-06-27 07:29 UTC · model grok-4.3

classification ⚛️ physics.ao-ph cs.AI
keywords agentic systemsclimate modelingretrieval-augmented generationCMIP6autonomous workflowsadversarial reviewEarth system data
0
0 comments X

The pith

An agentic system can autonomously retrieve CMIP6 literature, generate code for live data analysis, and audit its own workflows through layered guardrails.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CMIP-Forge as a hybrid system that pairs a large curated corpus of CMIP6 publications with an agent that plans and runs Python code against Earth system data archives. It adds multiple automated checks, including static code analysis and a separate panel of reviewer models that examine the full methodology. The goal is to overcome the manual effort required to turn thousands of papers and massive data collections into finished research tasks such as studying teleconnections or regional extremes. If the approach holds, it would let research pipelines run end to end without constant human direction while still grounding results in published science.

Core claim

CMIP-Forge demonstrates that an agentic analysis system grounded in peer-reviewed literature, constrained by automated code guardrails, and audited by an independent adversarial review loop can complete complex climate-research workflows autonomously, as shown through pipelines on atmospheric teleconnections, ocean dynamics, regional extremes, and global warming projections.

What carries the argument

The multi-layered Defense-in-Depth architecture that combines AST static analysis, audited scientific primitives, and an autonomous adversarial peer-review protocol to enforce physical and methodological invariants.

If this is right

  • End-to-end autonomous pipelines become feasible for tasks that currently require teams to sift through literature and data manually.
  • The same architecture can support the transition from CMIP6 to CMIP7 by turning unstructured publications into operational analysis routines.
  • Failure modes such as sycophantic regression or unresolved review verdicts become detectable through the released immutable telemetry.
  • Provenance records for every step allow later inspection of how literature, code, and review decisions combined.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same retrieval-plus-review pattern could be tested on other data-rich fields that maintain large open archives and publication corpora.
  • If the guardrails scale, the time between identifying a question in the literature and obtaining a first data-driven answer could shrink substantially.
  • Extending the reviewer panel to include models fine-tuned on domain-specific error patterns might further reduce undetected mistakes.

Load-bearing premise

The layered checks and independent review loop are enough to catch and fix errors in generated workflows without any human intervention.

What would settle it

A generated workflow that produces results violating known physical constraints yet passes the full review loop and is accepted as valid.

Figures

Figures reproduced from arXiv: 2606.17076 by Boris Shapkin, Dmitrii Pantiukhin, Ivan Kuznetsov, Nikolay Koldunov, Thomas Jung.

Figure 1
Figure 1. Figure 1: CMIP-Forge agentic architecture. A user prompt is consumed by a ReAct worker agent (LangGraph) whose system prompt encodes nine geophysical invariants, seven failure-mode exemplars, and the Empirical Defiance Protocol. The agent has access to fourteen tools grouped into five categories. Literature retrieval is backed by a Qdrant hybrid-search index (dense Gemini Embedding 2 plus sparse BM25 across 101,828 … view at source ↗
Figure 2
Figure 2. Figure 2: Upstream oceanic diagnostic: AMOC kinematic fingerprint and 15-model historical [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Downstream atmospheric response: model-dependent European shielding effect under [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Projected evolution of ENSO amplitude and frequency, 1950–2100, under SSP5-8.5 for [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Projected change in oceanic frontal sharpness [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Projected Mediterranean summer (JJA) warming, 1960–2100, for the three carry [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Structural evolution of the North Atlantic Oscillation across three 30-year windows. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Three regional precipitation regimes under SSP5-8.5, 1950–2100. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Constrained-ensemble GMST projections, 1960–2100, under three SSP scenarios. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
read the original abstract

The Coupled Model Intercomparison Project Phase 6 (CMIP6) has generated thousands of peer-reviewed publications documenting model configurations, evaluation procedures, emergent constraints, and projection uncertainties. As the community transitions toward CMIP7, efficiently extracting and operationalizing this unstructured knowledge alongside live data analysis represents a critical bottleneck. Here we present CMIP-Forge, a hybrid retrieval-augmented generation (RAG) and autonomous analysis system that bridges the gap between scientific literature and Earth System Grid Federation (ESGF) data archives. The system pairs a curated corpus of 6,581 CMIP6-related open-access publications (101,828 indexed chunks) with an agentic pipeline in which a tool-augmented worker plans and executes Python workflows over live climate data, while a panel of independent reviewer models audits its methodology end to end. CMIP-Forge introduces a multi-layered Defense-in-Depth architecture that enforces physical and methodological invariants through executable mechanisms: Abstract Syntax Tree (AST) static analysis, audited scientific primitives, and an autonomous adversarial peer-review protocol. We demonstrate the system's capabilities through end-to-end autonomous research pipelines spanning atmospheric teleconnections, ocean dynamics, regional extremes, and global warming projections. An agentic analysis system grounded in peer-reviewed literature, constrained by automated code guardrails, and audited by an independent adversarial review loop can complete complex climate-research workflows autonomously. The same experiments expose concrete failure modes of the review loop (sycophantic regression, REVISE verdicts that are never resolved, and the submission of stub code for review), each diagnosable from the immutable telemetry and provenance record released with the article.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents CMIP-Forge, a hybrid RAG and agentic system pairing a corpus of 6,581 CMIP6 publications with tool-augmented LLM workers that plan and execute Python workflows on ESGF data. A multi-layered Defense-in-Depth architecture (AST static analysis, audited scientific primitives, autonomous adversarial peer-review protocol) is claimed to enforce physical and methodological invariants. End-to-end autonomous pipelines are demonstrated across teleconnections, ocean dynamics, extremes, and projections; the same experiments expose review-loop failure modes (sycophantic regression, unresolved REVISE verdicts, stub-code submissions) diagnosable from released telemetry.

Significance. If the architecture reliably enables fully autonomous, error-correcting workflows grounded in peer-reviewed literature, the system could materially reduce the bottleneck between CMIP6 knowledge and live data analysis. The release of immutable telemetry and provenance records is a concrete strength for reproducibility. However, the documented failure modes indicate that the autonomous review layer does not consistently surface or correct errors, limiting the immediate significance for production climate-research use.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'an agentic analysis system ... can complete complex climate-research workflows autonomously' lacks any quantitative success rates, error distributions, or human-baseline comparisons for the demonstrated pipelines; only qualitative demonstrations and failure modes are described.
  2. [Abstract] Abstract: the explicit listing of review-loop failure modes (sycophantic regression, REVISE verdicts that are never resolved, submission of stub code) directly tests and appears to falsify the weakest assumption that the Defense-in-Depth architecture (AST analysis + audited primitives + autonomous adversarial review) suffices to enforce invariants without human intervention.
minor comments (1)
  1. [Abstract] The abstract states that pipelines span 'atmospheric teleconnections, ocean dynamics, regional extremes, and global warming projections' but supplies no concrete results, figures, or section references for any of these demonstrations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for major revision. Below we respond point by point to the major comments on the abstract. We agree that the presentation of capabilities and limitations can be clarified and will make targeted revisions to the abstract and related text.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'an agentic analysis system ... can complete complex climate-research workflows autonomously' lacks any quantitative success rates, error distributions, or human-baseline comparisons for the demonstrated pipelines; only qualitative demonstrations and failure modes are described.

    Authors: The experiments in the manuscript were designed as qualitative end-to-end demonstrations of the system on representative climate tasks together with a diagnostic analysis of review-loop behavior via released telemetry. Quantitative success rates, error distributions, and human baselines were not computed as part of this work. We will revise the abstract to remove any implication of comprehensive quantitative validation and will add a brief summary of the number of pipelines executed and the observed outcomes where these counts can be extracted directly from the published telemetry. revision: partial

  2. Referee: [Abstract] Abstract: the explicit listing of review-loop failure modes (sycophantic regression, REVISE verdicts that are never resolved, submission of stub code) directly tests and appears to falsify the weakest assumption that the Defense-in-Depth architecture (AST analysis + audited primitives + autonomous adversarial review) suffices to enforce invariants without human intervention.

    Authors: We do not interpret the reported failure modes as falsifying the manuscript's claims. The abstract asserts only that an agentic system 'can complete complex climate-research workflows autonomously' under the described constraints; the successful demonstrations support this existential claim. The failure modes are presented explicitly to document current limitations of the autonomous review layer and to show that the immutable telemetry makes those limitations diagnosable. The Defense-in-Depth mechanisms are not asserted to eliminate all need for human oversight in every case. We will consider a minor rephrasing of the abstract to make this scope explicit if the editor deems it necessary. revision: no

Circularity Check

0 steps flagged

No derivation chain present; system capability claim is not a reduction of quantities

full rationale

The paper presents a system architecture and end-to-end demonstrations rather than any mathematical derivation, first-principles result, fitted parameter, or prediction that could reduce to its own inputs. No equations, ansatzes, uniqueness theorems, or self-citations of load-bearing results appear in the abstract or described content. The central statement is an empirical capability claim about an agentic workflow, and the documented failure modes are presented as observations from the same experiments rather than hidden equivalences. This matches the default case of a self-contained descriptive paper with no circularity to flag.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the untested effectiveness of the introduced guardrails and review protocol; no new physical constants or free parameters are introduced, but the architecture itself is postulated without external validation data in the provided text.

axioms (2)
  • domain assumption Live ESGF data archives remain accessible and queryable via Python during autonomous execution.
    The agentic pipeline is described as operating over live archives; this is presupposed for any workflow to complete.
  • ad hoc to paper LLM-generated Python code can be constrained to valid scientific primitives by AST analysis and reviewer models.
    The Defense-in-Depth architecture is presented as enforcing invariants through these mechanisms.
invented entities (1)
  • Defense-in-Depth architecture no independent evidence
    purpose: Enforce physical and methodological invariants via AST static analysis, audited primitives, and adversarial peer-review.
    Introduced as the core novel mechanism of CMIP-Forge.

pith-pipeline@v0.9.1-grok · 5847 in / 1531 out tokens · 19475 ms · 2026-06-27T07:29:35.363476+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 15 canonical work pages

  1. [1]

    Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes

    Boiko, D. A., MacKnight, R., Kline, B., and Gomes, G. (2023). Autonomous chemical research with large language models. Nature, 624, 570--578. doi:10.1038/s41586-023-06792-0

  2. [2]

    A., Adeli, E., et al

    Bommasani, R., Hudson, D. A., Adeli, E., et al. (2021). On the opportunities and risks of foundation models. Preprint, arXiv:2108.07258

  3. [3]

    Chen, Y., Wang, W., Lobry, S., and Kurtz, C. (2024). An LLM agent for automatic geospatial data analysis. Preprint, arXiv:2410.18792

  4. [4]

    In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval

    Cormack, G. V., Clarke, C. L., and Buettcher, S. (2009). Reciprocal rank fusion outperforms Condorcet and individual rank learning methods. Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 758--759. doi:10.1145/1571941.1572114

  5. [5]

    Deng, C., et al. (2024). K2: A foundation language model for geoscience knowledge understanding and utilization. Proceedings of WSDM 2024, pp. 161--170. doi:10.1145/3616855.3635772

  6. [6]

    and Ditlevsen, S

    Ditlevsen, P. and Ditlevsen, S. (2023). Warning of a forthcoming collapse of the Atlantic meridional overturning circulation. Nature Communications, 14, 4254. doi:10.1038/s41467-023-39810-w

  7. [7]

    Dong, H., Niu, J., Wang, B., Zeng, W., Zhang, W., and He, C. (2026). MinerU-Diffusion: Rethinking document OCR as inverse rendering via diffusion decoding. Preprint, arXiv:2603.22458

  8. [8]

    B., Scaife, A

    Eade, R., Stephenson, D. B., Scaife, A. A., and Smith, D. M. (2024). Recalibration of missing low-frequency variability and trends in the North Atlantic Oscillation. Climate Dynamics, 62, 7869--7887. doi:10.1007/s00382-024-07311-1

  9. [9]

    2016 Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6) experimental design and organization.Geoscientific Model Development9, 1937–1958

    Eyring, V., Bony, S., Meehl, G. A., et al. (2016). Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6): Experimental design and organization. Geoscientific Model Development, 9, 1937--1958. doi:10.5194/gmd-9-1937-2016

  10. [10]

    FastEmbed: A lightweight Python library for fast embedding generation

    Qdrant Solutions GmbH (2024). FastEmbed: A lightweight Python library for fast embedding generation. https://github.com/qdrant/fastembed

  11. [11]

    Guo, T., et al. (2024). Large language model based multi-agents: A survey of progress and challenges. Preprint, arXiv:2402.01680

  12. [12]

    Hersbach, H., Bell, B., Berrisford, P., et al. (2020). The ERA5 global reanalysis. Quarterly Journal of the Royal Meteorological Society, 146, 1999--2049. doi:10.1002/qj.3803

  13. [13]

    Hong, S., et al. (2024). MetaGPT: Meta programming for a multi-agent collaborative framework. ICLR 2024; arXiv:2308.00352

  14. [14]

    Huai, B., et al. (2025). Future large-scale atmospheric circulation changes and Greenland precipitation. npj Climate and Atmospheric Science, 8, 10. doi:10.1038/s41612-025-00899-z

  15. [15]

    Koldunov, N. V. and Jung, T. (2024). Local climate services for all, courtesy of large language models. Communications Earth & Environment, 5, 13. doi:10.1038/s43247-023-01199-1

  16. [16]

    A., Pantiukhin, D., et al

    Kuznetsov, I., Jost, A. A., Pantiukhin, D., et al. (2025). Transforming climate services with LLMs and multi-source data integration. npj Climate Action, 4, 97. doi:10.1038/s44168-025-00300-y

  17. [17]

    Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y., Wang, R., Yang, Y., Shi, S., and Tu, Z. (2024). Encouraging divergent thinking in large language models through multi-agent debate. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). doi:10.18653/v1/2024.emnlp-main.992

  18. [18]

    McKenna, C. M. and Maycock, A. C. (2021). Sources of uncertainty in multimodel large ensemble projections of the winter North Atlantic Oscillation. Geophysical Research Letters, 48, e2021GL093258. doi:10.1029/2021GL093258

  19. [19]

    Mitevski, I., et al. (2025). More positive and less variable North Atlantic Oscillation at high CO _2 forcing. npj Climate and Atmospheric Science, 8, 171. doi:10.1038/s41612-025-01051-7

  20. [20]

    Pantiukhin, D., et al. (2025). Accelerating earth science discovery via multi-agent LLM systems. Frontiers in Artificial Intelligence, 8. doi:10.3389/frai.2025.1674927

  21. [21]

    Pantiukhin, D., et al. (2026). A hierarchical multi-agent system for autonomous discovery in geoscientific data archives. Preprint, arXiv:2602.21351

  22. [22]

    Qdrant: Open-source vector similarity search engine

    Qdrant Solutions GmbH (2024). Qdrant: Open-source vector similarity search engine. https://qdrant.tech

  23. [23]

    Schick, T., et al. (2023). Toolformer: Language models can teach themselves to use tools. NeurIPS 2023; arXiv:2302.04761

  24. [24]

    Song, X., Yin, Z., and Wang, H. (2024). Interdecadal changes in the links between late-winter NAO and North Atlantic tripole SST and possible mechanism. Geophysical Research Letters, 51, e2024GL110138. doi:10.1029/2024GL110138

  25. [25]

    Thulke, D., et al. (2024). ClimateGPT: Towards AI synthesizing interdisciplinary research on climate change. Preprint, arXiv:2401.09646

  26. [26]

    Wang, B., Xu, C., Zhao, X., Ouyang, L., Wu, F., Zhao, Z., Xu, R., Liu, K., Qu, Y., Shang, F., et al. (2024). MinerU: An open-source solution for precise document content extraction. Preprint, arXiv:2409.18839

  27. [27]

    Wang, B., He, T., Ouyang, L., Wu, F., Zhao, Z., Chu, T., Qu, Y., Jin, Z., Zeng, W., Miao, Z., et al. (2026). MinerU2.5-Pro: Pushing the limits of data-centric document parsing at scale. Preprint, arXiv:2604.04771

  28. [28]

    Yao, S., et al. (2023). ReAct: Synergizing reasoning and acting in language models. ICLR 2023; arXiv:2210.03629