arxiv: 2604.16922 · v2 · submitted 2026-04-18 · 💻 cs.AI

Recognition: unknown

ClimAgent: LLM as Agents for Autonomous Open-ended Climate Science Analysis

Hao Wang , Jindong Han , Wei Fan , Hao Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:27 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsclimate scienceautonomous analysistool useClimaBenchend-to-end modelingreasoning protocols

0 comments

The pith

ClimAgent turns LLMs into agents that run full climate science analyses by linking tools with structured reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents ClimAgent, a framework that turns large language models into autonomous agents capable of tackling open-ended climate science problems. It combines a shared toolkit with strict reasoning steps to move past simple question-answering and instead perform complete data analysis and modeling while respecting physical constraints. A sympathetic reader would care because the explosion of climate datasets and tools has outpaced human researchers, creating bottlenecks that such an agent system could help clear. The work introduces ClimaBench as a test suite drawn from real professional tasks between 2000 and 2025, and shows the new method delivers more rigorous and practical outputs than prior approaches.

Core claim

ClimAgent is introduced as a general-purpose autonomous framework for executing a wide spectrum of research tasks across diverse climate sub-fields. By integrating a unified tool-use environment with rigorous reasoning protocols, it enables end-to-end modeling and analysis instead of being limited to retrieval. Its effectiveness is shown through experiments on ClimaBench, the first comprehensive benchmark for real-world climate discovery spanning five task categories from professional scenarios.

What carries the argument

A unified tool-use environment combined with rigorous reasoning protocols that allows the agent to handle intricate physical constraints and data-driven requirements in climate analysis.

Load-bearing premise

That combining a unified set of tools with structured reasoning protocols is sufficient to manage the detailed physical constraints and data complexities of professional climate science without oversimplifying important aspects.

What would settle it

Demonstrating that ClimAgent outputs violate established physical principles in climate models or produce impractical results when applied to a new dataset outside the 2000-2025 benchmark period.

Figures

Figures reproduced from arXiv: 2604.16922 by Hao Liu, Hao Wang, Jindong Han, Wei Fan.

**Figure 2.** Figure 2: The overall framework of our proposed ClimAgent. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Left: Distribution of task categories and domains in our ClimaBench. Right: Example of one task&solution pair in the dataset. Dataset Description. Here, we introduce the main characterization of the dataset in ClimaBench containing 320 tasks. The five categories in ClimaBench capture the major reasoning needs of data-driven climate science: (1) data query tasks (e.g., dataset collection), (2) concept anal… view at source ↗

**Figure 4.** Figure 4: Ablation study and cost efficiency analysis. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: The prompt used for Analysis Evaluation. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: The prompt used for Solution Correction. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: The prompt used for Solution Correction. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: The prompt used for Practicality and Scientificity. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: The prompt used for Result and Bias Analysis. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 11.** Figure 11: Case Study for Typhoon Analysis [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Case Study for Typhoon Analysis [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

read the original abstract

Climate research is pivotal for mitigating global environmental crises, yet the accelerating volume of multi-scale datasets and the complexity of analytical tools have created significant bottlenecks, constraining scientific discovery to fragmented and labor-intensive workflows. While the emergence Large Language Models (LLMs) offers a transformative paradigm to scale scientific expertise, existing explorations remain largely confined to simple Question-Answering (Q&A) tasks. These approaches often oversimplify real-world challenges, neglecting the intricate physical constraints and the data-driven nature required in professional climate science.To bridge this gap, we introduce ClimAgent, a general-purpose autonomous framework designed to execute a wide spectrum of research tasks across diverse climate sub-fields. By integrating a unified tool-use environment with rigorous reasoning protocols, ClimAgent transcends simple retrieval to perform end-to-end modeling and analysis. To foster systematic evaluation, we propose ClimaBench, the first comprehensive benchmark for real-world climate discovery. It encompasses challenging problems spanning 5 distinct task categories derived from professional scenarios between 2000 and 2025. Experiments on ClimaBench demonstrate that ClimAgent significantly outperforms state-of-the-art baselines, achieving a 40.21% improvement over original LLM solutions in solution rigorousness and practicality. Our code are available at https://github.com/usail-hkust/ClimAgent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ClimAgent applies standard LLM-agent patterns to climate tasks and adds a benchmark, but the 40% gain claim has no visible experimental backing.

read the letter

ClimAgent is an LLM-agent framework aimed at autonomous climate science work, paired with ClimaBench as a new test set drawn from real professional scenarios across five task types from 2000-2025. The paper points out that plain Q&A setups miss the physical constraints and data-heavy nature of actual climate analysis, then tries to fix that with tool integration and structured reasoning for end-to-end modeling. That framing is reasonable and the benchmark idea could give others a shared way to measure progress on these kinds of problems. The code release is also a plus for anyone who wants to try it out. The main weakness is the headline result: a 40.21% lift in rigorousness and practicality over baselines. Nothing in the abstract explains what the baselines were, how rigorousness was scored, what the per-task numbers look like, or whether any statistical checks were run. Without those pieces the number is impossible to interpret, and it is not clear whether the tasks actually stress the hard parts of climate work or just test surface-level tool calling. This is the sort of paper that would interest people building domain-specific agents or running AI-for-science experiments, but it does not move the core agent methods forward. A reader already working on climate tools might pick up the benchmark for their own tests. The work is coherent enough on its own terms to deserve peer review so the methods and evaluation details can be checked properly.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces ClimAgent, an LLM-based autonomous agent framework for open-ended climate science analysis. It integrates a unified tool-use environment with rigorous reasoning protocols to enable end-to-end modeling and analysis across climate sub-fields, moving beyond simple Q&A tasks. The authors propose ClimaBench, the first comprehensive benchmark spanning 5 task categories derived from professional climate scenarios (2000-2025). Experiments on ClimaBench are reported to show that ClimAgent significantly outperforms state-of-the-art baselines, with a 40.21% improvement in solution rigorousness and practicality. Code is made available at a GitHub repository.

Significance. If the reported performance gains hold under rigorous scrutiny, the work would be significant for advancing LLM agents in complex, data-driven scientific domains. It addresses a clear gap by targeting physical constraints and professional workflows in climate research rather than simplified tasks, and the introduction of ClimaBench could serve as a useful evaluation resource. Open-sourcing the code supports reproducibility, which is a positive attribute for this type of systems paper.

major comments (1)

Abstract: The central empirical claim of a 40.21% improvement in rigorousness and practicality is presented without any details on experimental setup, including the specific baselines used, definitions or rubrics for the metrics 'rigorousness' and 'practicality', task-level results, statistical tests, or variance across runs. This absence prevents assessment of whether the improvement is load-bearing or reproducible.

minor comments (2)

Abstract: Typo/grammar: 'the emergence Large Language Models' should read 'the emergence of Large Language Models'.
Abstract: Grammar: 'Our code are available' should be 'Our code is available'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below and will revise the manuscript to improve transparency of the reported results.

read point-by-point responses

Referee: Abstract: The central empirical claim of a 40.21% improvement in rigorousness and practicality is presented without any details on experimental setup, including the specific baselines used, definitions or rubrics for the metrics 'rigorousness' and 'practicality', task-level results, statistical tests, or variance across runs. This absence prevents assessment of whether the improvement is load-bearing or reproducible.

Authors: We agree that the abstract, constrained by length, omits key experimental details that would aid assessment. In the revised version we will expand the abstract to (1) name the primary baselines (vanilla GPT-4, Claude-3, and Llama-3 without agent scaffolding), (2) briefly define the two metrics according to the rubric used in Section 4.2 (rigorousness: adherence to physical constraints and logical consistency; practicality: feasibility of the proposed solution under real-world data and tool constraints), and (3) state that full task-level scores, statistical significance tests, and standard deviations across three runs appear in Tables 2–4 and Figure 3. These additions will make the 40.21% aggregate claim traceable without exceeding typical abstract length. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents ClimAgent as an empirical agent framework and ClimaBench as a new benchmark for climate tasks, with the central result being an observed performance improvement (40.21%) reported from experiments. No derivation chain, equations, first-principles predictions, or self-referential definitions exist that could reduce outputs to inputs by construction. The work is self-contained as an applied empirical contribution without load-bearing self-citations or fitted parameters renamed as predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claim rests on the effectiveness of LLM tool integration and reasoning protocols for climate tasks plus the representativeness of the new benchmark; no free parameters, mathematical axioms, or new physical entities are specified in the abstract.

invented entities (2)

ClimAgent no independent evidence
purpose: Autonomous LLM-based agent framework for executing climate research tasks
New system introduced to perform end-to-end modeling and analysis.
ClimaBench no independent evidence
purpose: Comprehensive benchmark spanning five task categories from real climate scenarios 2000-2025
Proposed as the first benchmark for evaluating professional-level climate discovery.

pith-pipeline@v0.9.0 · 5525 in / 1309 out tokens · 71852 ms · 2026-05-10T07:27:18.916640+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 4 canonical work pages · 1 internal anchor

[1]

A survey of scientific large language models: From data foundations to agent frontiers.arXiv preprint arXiv:2508.21148, 2025

Scalable pre-training of compact urban spatio- temporal predictive models on large-scale multi- domain data.Proceedings of the VLDB Endowment, 18(7):2149–2158. J Hansen, R Ruedy, A Lacis, Mki Sato, L Nazarenko, N Tausnev, I Tegen, and D Koch. 2000. Climate mod- eling in the global warming debate. InInternational Geophysics, volume 70, pages 127–164. Elsev...

work page arXiv 2000
[2]

arXiv preprint arXiv:2412.00821 , year=

Benchmarking large language models as ai re- search agents. InNeurIPS 2023 Foundation Models for Decision Making Workshop. Raj Jaiswal, Dhruv Jain, Harsh Parimal Popat, Avinash Anand, Abhishek Dharmadhikari, Atharva Marathe, and Rajiv Ratn Shah. 2024. Improving physics rea- soning in large language models using mixture of refinement agents.arXiv preprint ...

work page arXiv 2023
[3]

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al

Stella: Self-evolving llm agent for biomedical research.arXiv preprint arXiv:2507.02004. Maximilian Kotz, Anders Levermann, and Leonie Wenz

work page arXiv
[4]

Nature, 628(8008):551–557

The economic commitment of climate change. Nature, 628(8008):551–557. Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, and 1 others
[5]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437. Fan Liu, Zherui Yang, Cancheng Liu, Tianrui Song, Xi- aofeng Gao, and Hao Liu. 2025. Mm-agent: Llm as agents for real-world mathematical modeling prob- lem.arXiv preprint arXiv:2505.14148. Veeramakali Vignesh Manivannan, Yasaman Jafari, Srikar Eranky, Spencer Ho, Rose Yu, Duncan Watson- Parris,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

plan-execute-verify

is an agentic framework designed to auto- mate end-to-end data science workflows, bridg- ing the gap between raw data and actionable in- sights through autonomous exploration. It uti- lizes agentic Large Language Models (LLMs) to perform complex, multi-step tasks such as data preprocessing, feature engineering, and model optimization. The system operates ...
[7]

The judge is provided with the ground-truth problem description, the agent’s solution, and a detailed scoring prompt to generate scores and critiques for all four di- mensions

LLM-as-a-Judge:We utilizeGPT-4oas an automated evaluator. The judge is provided with the ground-truth problem description, the agent’s solution, and a detailed scoring prompt to generate scores and critiques for all four di- mensions
[8]

Scien- tificity

Expert Human Review:For a randomly sam- pled subset of solutions, we enlist domain ex- perts (PhD-level researchers in atmospheric sci- ence and applied mathematics) to perform blind reviews. This human-in-the-loop step serves to calibrate the LLM judge and verify the "Scien- tificity" metric, which is often challenging for automated systems to assess acc...

2023