GeoDisaster: Benchmarking Orchestrated Agents for Operational Disaster Geo-Intelligence

Aman Verma; Biplab Banerjee; Daksh Jain; Hariseetharam Gunduboina; Maram Hasan; Muhammad Haris Khan; Savitra Roy; Subhasis Chaudhuri

arxiv: 2606.17246 · v1 · pith:26SRDB7Rnew · submitted 2026-06-15 · 💻 cs.CV · cs.MA

GeoDisaster: Benchmarking Orchestrated Agents for Operational Disaster Geo-Intelligence

Maram Hasan , Aman Verma , Savitra Roy , Hariseetharam Gunduboina , Daksh Jain , Muhammad Haris Khan , Subhasis Chaudhuri , Biplab Banerjee This is my paper

Pith reviewed 2026-06-27 03:45 UTC · model grok-4.3

classification 💻 cs.CV cs.MA

keywords GeoDisaster benchmarkremote sensing vision-language modelsmulti-agent orchestrationdisaster geo-intelligencegeospatial tool useRole-Contract Expectation AlignmentSAR and optical imageryoperational decision generation

0 comments

The pith

GeoDisaster benchmark requires tool-grounded spatial reasoning for disaster tasks that current RS-VLMs cannot meet, while RCEA alignment improves agent tool use and consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GeoDisaster as a benchmark of 2,921 instances spanning five disaster task families that demand integration of optical imagery, SAR data, vector layers, and road networks into executable decisions. It claims existing remote-sensing vision-language models and agent systems fall short on operational geo-intelligence because they lack structured tool access and evidence-backed outputs. Ground truth is generated from deterministic geospatial workflows rather than language-model judgments. The authors introduce an orchestrated multi-agent system with 18 tools and Role-Contract Expectation Alignment to coordinate specialized agents via explicit contracts, supervised fine-tuning, and reinforcement learning. Experiments indicate the benchmark exposes limitations and that RCEA yields gains in tool selection, evidence grounding, state tracking, and final decisions.

Core claim

GeoDisaster supplies 2,921 verified instances across 43 question types that integrate heterogeneous EO/GIS evidence and require hazard detection, damage assessment, exposure estimation, and report generation; ground-truth labels derive directly from executable geospatial workflows and consistency checks. The Role-Contract Expectation Alignment method aligns role-specialized agents through failure-aware supervised fine-tuning and contract-grounded reinforcement learning over dense step-level signals, producing measurable improvements in tool use, evidence grounding, state consistency, and decision generation over prior RS-VLMs and agentic baselines.

What carries the argument

Role-Contract Expectation Alignment (RCEA), a training procedure that combines failure-aware supervised fine-tuning with contract-grounded reinforcement learning to enforce explicit execution contracts among 18 disaster-oriented tools coordinated by role-specialized agents.

If this is right

Existing RS-VLMs and agentic systems will continue to underperform on tasks that require chaining heterogeneous geospatial tools and producing evidence-backed outputs.
Benchmarks grounded in executable workflows eliminate reliance on language-model-generated labels for disaster-related spatial reasoning.
Multi-agent coordination via explicit contracts and step-level reinforcement signals improves reliability in state tracking and decision generation.
The five task families provide standardized evaluation for deforestation monitoring, multi-hazard analysis, building-damage assessment, flood-safe routing, and SAR flood monitoring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar contract-based alignment could be tested on non-disaster geospatial tasks such as infrastructure monitoring or agricultural yield estimation.
The benchmark design suggests that future agent systems may need native integration with GIS execution environments rather than text-only interfaces.
If RCEA gains hold, operational centers could adopt role-specialized agent teams for rapid disaster assessment instead of single monolithic models.

Load-bearing premise

Executable geospatial workflows and deterministic consistency checks can serve as complete, unbiased ground truth for operational disaster reasoning without missing subjective or contextual factors.

What would settle it

Run the 18-tool RCEA agents and baseline systems on the full 2,921 instances and measure whether RCEA produces statistically higher accuracy in tool selection, evidence citation, state consistency, or final decision correctness; absence of such gains would falsify the improvement claim.

Figures

Figures reproduced from arXiv: 2606.17246 by Aman Verma, Biplab Banerjee, Daksh Jain, Hariseetharam Gunduboina, Maram Hasan, Muhammad Haris Khan, Savitra Roy, Subhasis Chaudhuri.

**Figure 2.** Figure 2: Examples from GeoDisaster task families. Left: A flood-safe routing example in Sweden, where satellite context and flood-routing evidence are used to generate a route overlay and compare fastest, safest, and balanced routes. Right: A multi-hazard NO2 exposure example in Los Angeles, where satellite context and exposure artifacts support population-exposure estimation. The traces illustrate representative c… view at source ↗

read the original abstract

Remote-sensing vision-language models (RS-VLMs) have advanced Earth-observation analysis toward visual interpretation and instruction-following, yet fall short of operational geo-intelligence, which demands tool-grounded spatial reasoning and structured, evidence-backed decisions. We introduce GeoDisaster, an operational geospatial disaster reasoning benchmark with 2,921 verified instances across 43 question types and five task families: deforestation monitoring, multi-hazard analysis, building-damage assessment, flood-safe routing, and Sentinel-1 SAR flood monitoring. Instances integrate heterogeneous EO/GIS evidence-optical and SAR imagery, raster masks, vector geometries, road networks, and exposure layers-spanning hazard detection, damage assessment, exposure estimation, and diagnostic report generation. Ground-truth answers are grounded in executable geospatial workflows and deterministic consistency checks, removing the need for language-model annotation. We further propose an orchestrated multi-agent framework with 18 disaster-oriented tools, where role-specialized agents coordinate through explicit execution contracts, aligned via Role-Contract Expectation Alignment (RCEA): failure-aware supervised fine-tuning combined with contract-grounded reinforcement learning over dense step-level signals. Experiments show that GeoDisaster challenges existing RS-VLMs and agentic systems, while RCEA improves tool use, evidence grounding, state consistency, and decision generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GeoDisaster gives a concrete benchmark with workflow-grounded labels plus a contract-based agent training method, but the abstract leaves the size of the gains unclear.

read the letter

The main things here are a new benchmark of 2,921 instances across five disaster-related geospatial task families and a training procedure called RCEA that combines failure-aware fine-tuning with contract-grounded reinforcement learning.

The benchmark stands out because its answers come from executable geospatial workflows and deterministic checks rather than language model judgments. That choice makes the evaluation more reproducible and removes a common source of circularity. The tasks mix optical and SAR imagery with vector data for things like flood routing and building damage, which matches real operational needs better than many existing vision-language benchmarks. The 18-tool multi-agent setup with explicit role contracts is a straightforward way to organize tool use and state tracking.

The paper does a reasonable job showing why current RS-VLMs struggle with evidence grounding and consistent decisions in this domain. The deterministic ground truth is a genuine strength.

The soft spot is the lack of any numbers. The abstract says RCEA improves tool use and consistency, yet supplies no baseline scores, ablation results, or error breakdowns. Without those, it is hard to judge whether the benchmark is actually harder than prior agent tests or how much the training adds. If the full paper has solid tables and comparisons, that changes the picture.

This is for people working on tool-using agents in remote sensing or crisis applications. A reader already building geospatial agents would get value from the task definitions and the contract idea.

It deserves peer review. The evaluation approach is reproducible enough and the domain is applied enough that referees can give useful feedback on the experimental design and comparisons.

Referee Report

1 major / 0 minor

Summary. The paper introduces GeoDisaster, a benchmark with 2,921 verified instances across 43 question types and five task families (deforestation monitoring, multi-hazard analysis, building-damage assessment, flood-safe routing, Sentinel-1 SAR flood monitoring) for operational geospatial disaster reasoning. Ground truth is derived from executable geospatial workflows and deterministic consistency checks rather than LM annotation. It proposes an orchestrated multi-agent system with 18 disaster-oriented tools coordinated via explicit execution contracts, trained with Role-Contract Expectation Alignment (RCEA) that combines failure-aware supervised fine-tuning and contract-grounded reinforcement learning. The central claim is that the benchmark challenges existing RS-VLMs and agentic systems while RCEA yields improvements in tool use, evidence grounding, state consistency, and decision generation.

Significance. If the experimental results hold, the benchmark's construction from executable workflows and deterministic checks represents a methodological strength that could reduce annotation bias in geo-intelligence evaluation. The RCEA regime for multi-agent coordination addresses a practical gap in deploying agents for structured, evidence-backed decisions on heterogeneous EO/GIS data. This could support more reliable operational systems if the claimed gains are reproducible.

major comments (1)

[Abstract] Abstract: The claim that 'Experiments show that GeoDisaster challenges existing RS-VLMs and agentic systems, while RCEA improves tool use, evidence grounding, state consistency, and decision generation' is unsupported by any quantitative results, baseline comparisons, ablation studies, metrics, or error analysis. This absence is load-bearing for the central claim, as the data-to-claim link cannot be evaluated from the manuscript.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for identifying this critical issue with the abstract. We address the comment point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'Experiments show that GeoDisaster challenges existing RS-VLMs and agentic systems, while RCEA improves tool use, evidence grounding, state consistency, and decision generation' is unsupported by any quantitative results, baseline comparisons, ablation studies, metrics, or error analysis. This absence is load-bearing for the central claim, as the data-to-claim link cannot be evaluated from the manuscript.

Authors: We agree that the abstract currently asserts experimental outcomes without including, referencing, or summarizing any supporting quantitative evidence, baselines, metrics, or analysis from the manuscript. This renders the claim unevaluable as written. In the revised manuscript we will either (a) add a concise summary of key results (e.g., tool-use accuracy, consistency scores, and comparative deltas versus baselines) directly into the abstract or (b) rephrase the final sentence to describe the benchmark and RCEA framework without asserting unsupported performance gains. The Experiments section will remain the primary location for all quantitative details, ablations, and error analysis. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an empirical benchmark (GeoDisaster) and an agent framework (RCEA) whose central claims rest on experimental results and deterministic ground-truth construction from executable geospatial workflows, not on any derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing premises. No equations, uniqueness theorems, or ansatzes appear in the provided text, and the ground-truth mechanism is explicitly designed to be independent of language-model annotation. The absence of any load-bearing reduction to inputs by construction makes the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, parameters, or modeling assumptions are stated, so the ledger cannot be populated beyond the high-level claim that ground-truth derives from executable workflows.

pith-pipeline@v0.9.1-grok · 5788 in / 1194 out tokens · 65302 ms · 2026-06-27T03:45:59.641831+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 13 canonical work pages · 4 internal anchors

[1]

Geochat: Grounded large vision-language model for remote sensing

Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. In CVPR, 2024

2024
[2]

Rescueadi: Adaptive disaster interpretation in remote sensing images with autonomous agents.IEEE Transactions on Geoscience and Remote Sensing, 2025

Zhuoran Liu, Danpei Zhao, Bo Yuan, and Zhiguo Jiang. Rescueadi: Adaptive disaster interpretation in remote sensing images with autonomous agents.IEEE Transactions on Geoscience and Remote Sensing, 2025

2025
[3]

Earthgpt: A universal 13 multimodal large language model for multisensor image comprehension in remote sensing domain

Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. Earthgpt: A universal 13 multimodal large language model for multisensor image comprehension in remote sensing domain. IEEE Transactions on Geoscience and Remote Sensing, 2024

2024
[4]

Openearthagent: A unified framework for tool-augmented geospatial agents.arXiv preprint arXiv:2602.17665, 2026

Akashah Shabbir, Muhammad Umer Sheikh, Muhammad Akhtar Munir, Hiyam Debary, Mustansar Fiaz, Muhammad Zaigham Zaheer, Paolo Fraccaro, Fahad Shahbaz Khan, Muhammad Haris Khan, Xiao Xiang Zhu, and Salman Khan. Openearthagent: A unified framework for tool-augmented geospatial agents.arXiv preprint arXiv:2602.17665, 2026

work page arXiv 2026
[5]

Floodnet: A high resolution aerial imagery dataset for post flood scene understanding.IEEE Access, 2021

Maryam Rahnemoonfar, Tashnim Chowdhury, Argho Sarkar, Debvrat Varshney, Masoud Yari, and Robin Roberson Murphy. Floodnet: A high resolution aerial imagery dataset for post flood scene understanding.IEEE Access, 2021

2021
[6]

Thinkgeo: Evaluating tool-augmented agents for remote sensing tasks.arXiv preprint arXiv:2505.23752, 2025

Akashah Shabbir, Muhammad Akhtar Munir, Akshay Dudhane, Muhammad Umer Sheikh, Muham- mad Haris Khan, Paolo Fraccaro, Juan Bernabe Moreno, Fahad Shahbaz Khan, and Salman Khan. Thinkgeo: Evaluating tool-augmented agents for remote sensing tasks.arXiv preprint arXiv:2505.23752, 2025

work page arXiv 2025
[7]

Georeason: Aligning thinking and answering in remote sens- ing vision-language models via logical consistency reinforcement learning.arXiv preprint arXiv:2601.04118, 2026

Wenshuai Li, Xiantai Xiang, Zixiao Wen, Guangyao Zhou, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, and Yuxin Hu. Georeason: Aligning thinking and answering in remote sens- ing vision-language models via logical consistency reinforcement learning.arXiv preprint arXiv:2601.04118, 2026

work page arXiv 2026
[8]

Pan, Shuyi Yang, Lakshya A

Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent llm systems fail? InAdvances in Neural Information Processing Systems, 2025

2025
[9]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[10]

Multi-agent deep research: Training multi-agent systems with m-grpo

Haoyang Hong, Jiajun Yin, Yuan Wang, Jingnan Liu, Zhe Chen, Ailing Yu, Ji Li, Zhiling Ye, Hansong Xiao, Yefei Chen, Hualei Zhou, Yun Yue, Minghui Yang, Chunxiao Guo, Junwei Liu, Peng Wei, and Jinjie Gu. Multi-agent deep research: Training multi-agent systems with m-grpo. arXiv preprint arXiv:2511.13288, 2025

work page arXiv 2025
[12]

xView2: Assessing building damage from satellite imagery, 2019

Defense Innovation Unit. xView2: Assessing building damage from satellite imagery, 2019. xView2 Challenge

2019
[13]

Sen1floods11: A georefer- enced dataset to train and test deep learning flood algorithms for sentinel-1

Derrick Bonafilia, Beth Tellman, Tyler Anderson, and Erica Issenberg. Sen1floods11: A georefer- enced dataset to train and test deep learning flood algorithms for sentinel-1. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020

2020
[14]

Rescuenet: A high resolution uav semantic segmentation dataset for natural disaster damage assessment.Scientific Data, 2023

Maryam Rahnemoonfar, Tashnim Chowdhury, and Robin Murphy. Rescuenet: A high resolution uav semantic segmentation dataset for natural disaster damage assessment.Scientific Data, 2023

2023
[15]

Bacastow

Ronny H¨ansch, Jacob Arndt, Dalton Lunga, Matthew Gibb, Tyler Pedelose, Arnold Boedihardjo, Desiree Petrie, and Todd M. Bacastow. Spacenet 8 - the detection of flooded roads and buildings. In CVPRW, 2022

2022
[16]

Vqa-aid: Visual question answering for post-disaster damage assessment and analysis.IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2021

Argho Sarkar and Maryam Rahnemoonfar. Vqa-aid: Visual question answering for post-disaster damage assessment and analysis.IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2021. 14

2021
[17]

Disasterm3: A remote sensing vision-language dataset for disaster damage assessment and response.Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 2025

Junjue Wang, Weihao Xuan, Heli Qi, Zhihao Liu, Kunyi Liu, Yuhan Wu, Hongruixuan Chen, Jian Song, Junshi Xia, Zhuo Zheng, and Naoto Yokoya. Disasterm3: A remote sensing vision-language dataset for disaster damage assessment and response.Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 2025

2025
[18]

Hrvqa: A visual question answering benchmark for high-resolution aerial images.ISPRS Journal of Photogrammetry and Remote Sensing, 2024

Kun Li, George V osselman, and Michael Ying Yang. Hrvqa: A visual question answering benchmark for high-resolution aerial images.ISPRS Journal of Photogrammetry and Remote Sensing, 2024

2024
[19]

Rsvlm- qa: A benchmark dataset for remote sensing vision language model-based question answering

Xing Zi, Jinghao Xiao, Yunxiao Shi, Xian Tao, Jun Li, Ali Braytee, and Mukesh Prasad. Rsvlm- qa: A benchmark dataset for remote sensing vision language model-based question answering. Proceedings of the 33rd ACM International Conference on Multimedia, 2025

2025
[20]

Geommbench and geommagent: Toward expert-level multimodal intelligence in geoscience and remote sensing.CVPR, 2026

Aoran Xiao, Shihao Cheng, Yonghao Xu, Yexian Ren, Hongruixuan Chen, and Naoto Yokoya. Geommbench and geommagent: Toward expert-level multimodal intelligence in geoscience and remote sensing.CVPR, 2026

2026
[21]

Core: Full-path evaluation of llm agents beyond final state.Workshop on Large Agentic Workflows (LAW), NeurIPS, 2025

Panagiotis Michelakis, Yiannis Hadjiyiannis, and Dimitrios Stamoulis. Core: Full-path evaluation of llm agents beyond final state.Workshop on Large Agentic Workflows (LAW), NeurIPS, 2025

2025
[22]

Evaluating tool-augmented agents in remote sensing platforms.ICLR Workshop on Machine Learning for Remote Sensing (ML4RS), 2024

Simranjit Singh, Michael Fore, and Dimitrios Stamoulis. Evaluating tool-augmented agents in remote sensing platforms.ICLR Workshop on Machine Learning for Remote Sensing (ML4RS), 2024

2024
[23]

Rs-agent: Automating remote sensing tasks through intelligent agents.arXiv preprint arXiv:2406.07089, 2024

Wenjia Xu, Zijian Yu, Boyang Mu, Zhiwei Wei, Yuanben Zhang, Guangzuo Li, Jiuniu Wang, and Mugen Peng. Rs-agent: Automating remote sensing tasks through intelligent agents.arXiv preprint arXiv:2406.07089, 2024

work page arXiv 2024
[24]

Change- agent: Toward interactive comprehensive remote sensing change interpretation and analysis.IEEE Transactions on Geoscience and Remote Sensing, 2024

Chenyang Liu, Keyan Chen, Haotian Zhang, Zipeng Qi, Zhengxia Zou, and Zhenwei Shi. Change- agent: Toward interactive comprehensive remote sensing change interpretation and analysis.IEEE Transactions on Geoscience and Remote Sensing, 2024

2024
[25]

Empowering LLM Agents with Geospatial Awareness: Toward Grounded Reasoning for Wildfire Response

Yiheng Chen, Lingyao Li, Zihui Ma, Qikai Hu, Yilun Zhu, Min Deng, and Runlong Yu. Empowering llm agents with geospatial awareness: Toward grounded reasoning for wildfire response.arXiv preprint arXiv:2510.12061, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Earth-agent: Unlocking the full landscape of earth observation with agents.ICLR, 2026

Peilin Feng, Zhutao Lv, Junyan Ye, Xiaolei Wang, Xinjie Huo, Jinhua Yu, Wanghan Xu, Wenlong Zhang, Lei Bai, Conghui He, and Weijia Li. Earth-agent: Unlocking the full landscape of earth observation with agents.ICLR, 2026

2026
[27]

Multi-agent geospa- tial copilots for remote sensing workflows.IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2025

Chaehong Lee, Varatheepan Paramanayakam, Andreas Karatzas, Yanan Jian, Michael Fore, Heming Liao, Fuxun Yu, Ruopu Li, Iraklis Anagnostopoulos, and Dimitrios Stamoulis. Multi-agent geospa- tial copilots for remote sensing workflows.IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2025

2025
[28]

Toolllm: Facilitating large language models to master 16000+ real-world apis.International Conference on Learning Representations 2024 (ICLR 2024), 2024

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis.International Conference on Learning Representation...

2024
[29]

Agenttun- ing: Enabling generalized agent abilities for llms.Findings of the Association for Computational Linguistics: ACL 2024, 2024

Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. Agenttun- ing: Enabling generalized agent abilities for llms.Findings of the Association for Computational Linguistics: ACL 2024, 2024

2024
[30]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019. 15

work page internal anchor Pith review Pith/arXiv arXiv 1909
[31]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

2022
[32]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2023

2023
[33]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Group-in-group policy optimization for LLM agent training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for LLM agent training. InAdvances in Neural Information Processing Systems, volume 38, pages 46375–46408, 2025

2025
[35]

Selaur: Self evolving llm agent via uncertainty-aware rewards.Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2026

Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray, and Hua Wei. Selaur: Self evolving llm agent via uncertainty-aware rewards.Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2026

2026
[36]

Llm collaboration with multi-agent reinforcement learning.AAAI, 2026

Shuo Liu, Zeyu Liang, Xueguang Lyu, and Christopher Amato. Llm collaboration with multi-agent reinforcement learning.AAAI, 2026

2026
[37]

End-to-end optimization of llm-driven multi-agent search systems via heterogeneous-group-based reinforcement learning.Association for Computational Linguistics, 2026

Guanzhong Chen, Shaoxiong Yang, Chao Li, Wei Liu, Jian Luan, and Zenglin Xu. End-to-end optimization of llm-driven multi-agent search systems via heterogeneous-group-based reinforcement learning.Association for Computational Linguistics, 2026

2026
[38]

Lang Feng, Longtao Zheng, Shuo He, Fuxiang Zhang, and Bo An. Dr. mas: Stable reinforcement learning for multi-agent llm systems.arXiv preprint arXiv:2602.08847, 2026

work page arXiv 2026
[39]

Osmnx: New methods for acquiring, constructing, analyzing, and visualizing complex street networks.Computers, Environment and Urban Systems, 65:126–139, 2017

Geoff Boeing. Osmnx: New methods for acquiring, constructing, analyzing, and visualizing complex street networks.Computers, Environment and Urban Systems, 65:126–139, 2017. doi: 10.1016/j.compenvurbsys.2017.05.004

work page doi:10.1016/j.compenvurbsys.2017.05.004 2017
[40]

Hansen, Peter V

Matthew C. Hansen, Peter V . Potapov, Rebecca Moore, Matt Hancher, Svetlana A. Turubanova, Alexandra Tyukavina, David Thau, Stephen V . Stehman, Scott J. Goetz, Thomas R. Loveland, et al. High-resolution global maps of 21st-century forest cover change.Science, 342(6160):850–853,
[41]

doi: 10.1126/science.1244693

work page doi:10.1126/science.1244693
[42]

xbd: A dataset for as- sessing building damage from satellite imagery,

Ritwik Gupta, Bryce Goodman, Nirav Patel, Richard Hosfelt, Sandra Sajeev, Eric Heim, Jigar Doshi, Keane Lucas, Howie Choset, and Matthew Gaston. Creating xbd: A dataset for assessing building damage from satellite imagery.arXiv preprint arXiv:1911.09296, 2019

work page arXiv 1911
[43]

goal":"generate SAR flood report with known-water split

Copernicus Emergency Management Service. Copernicus emergency management service. https: //emergency.copernicus.eu/, 2026. Flood extent and emergency mapping products. 16 SUPPLEMENTARY MATERIAL S1 Dataset Details and Statistics GeoDisaster contains 2,921 verified instances across five disaster task families and 43 question types. Beyond dataset size, its ...

2026
[44]

goal":"xBD summary

ORC plan:{"goal":"xBD summary", "steps":[{"agent":"GA", "subtask":"prepare evidence"}, {"agent":"VRA", "subtask":"assess damage"}, {"agent":"PA", "subtask":"compute statistics"}]}
[45]

{task: Prepare xBD evidence, evidence: [post-disaster RGB, mask], stop criteria: work_1 prepared with polygons, severe mask, and quadrant grid}

ORC→GA: CallAgent(agent=GA, contract="{task: Prepare xBD evidence, evidence: [post-disaster RGB, mask], stop criteria: work_1 prepared with polygons, severe mask, and quadrant grid}")
[46]

GA tool:ReadData(inputs=image_1);PrepareScene(mode=xbd_scene_summary, grid=2x2)
[47]

GA Tool Output:scene_1;271polygons and288ksevere pixels loaded.work_1prepared with2×2quadrant grid
[48]

ORC→VRA: CallAgent(agent=VRA, contract="{task: Assess scene-level building damage, evidence: [work_1, building polygons, post-disaster RGB, target mask], stop criteria: damage_1 produced with per-building and scene-level damage class}")
[49]

VRA tool:AssessDamage(work=work_1, aggregation=scene)
[50]

VRA Tool Output:damage_1;271/271buildings destroyed, corresponding to100%severe damage
[51]

ORC→P A: CallAgent(agent=PA, contract="{task: Compute area, mask validation, and spatial concentration, evidence: [work_1, damage_1, severe mask, quadrant grid], stop criteria: stats_1 produced with area, dispersion score, and worst quadrant}")
[52]

P A tool:ComputeStats(work=work_1, metrics=[area, mask_validation, dispersion, worst_quadrant])
[53]

Scene is spatially dispersed with complete severe-damage dominance

P A Tool Output:stats_1; severe area =954,478 m 2; dispersion score =0.34; worst quadrant =Q0. Scene is spatially dispersed with complete severe-damage dominance
[54]

destroyed

ORC: Terminate:{"destroyed":271, "severe_pct":100, "severe_m2":954478, "dispersion":"dispersed", "worst_Q":"Q0"} Figure S7:Example GeoDisaster for scene-level building-damage assessment. The task combines pre/post imagery, damage polygons, quadrant overlays, and role-specialized tool calls to produce a structured damage summary with counts, severe-area es...

2021
[55]

goal":"estimate burned area

ORC plan:{"goal":"estimate burned area", "steps":[{"agent":"GA", "subtask":"prepare burned-area context"}, {"agent":"PA", "subtask":"measure footprint area"}, {"agent":"PA", "subtask":"validate agreement"}]}
[56]

{task: Open burned-area product for AOI, evidence: [pre-fire image, post-fire image, burn product, AOI boundary], stop criteria: work_1 prepared with burn mask and scene context}

ORC→GA: CallAgent(agent=GA, contract="{task: Open burned-area product for AOI, evidence: [pre-fire image, post-fire image, burn product, AOI boundary], stop criteria: work_1 prepared with burn mask and scene context}")
[57]

GA tool:LoadCachedRaster(aoi_id=dixie_fire_aoi, handle=mcd64a1_event); PrepareScene(inputs={burned_area:scene_1}, mode=burned_area_assessment)
[58]

GA Tool Output:scene_1; MODIS MCD64A1 BurnDate loaded for2021-07-01to2021-09-29; modality =burned_area_mask; scale =500m.work_1prepared with binary burn mask
[59]

ORC→P A: CallAgent(agent=PA, contract="{task: Compute burned footprint area, evidence: [work_1, burn mask, AOI boundary, post-fire overlay], stop criteria: burned-area estimate produced in hectares with source agreement}")
[60]

P A tool:ComputeArea(work=work_1, mask=burned_area_mask, unit=ha); CompareAcrossSources(handles=[mcd64a1_event], metric=area_agreement)
[61]

MODIS MCD64A1 area =17,874.54 ha; agreement class =high

P A Tool Output:burned_area_1; one source used. MODIS MCD64A1 area =17,874.54 ha; agreement class =high
[62]

value":17874.54,

ORC: Terminate:{"value":17874.54, "units":"ha", "source":"MODIS MCD64A1", "evidence":"binary burn mask + AOI boundary"}
[63]

Figure S8:Example GeoDisaster for wildfire burned-area estimation

ORC: Final Answer:Burned area =17,874.54 ha, evidence:MODIS MCD64A1. Figure S8:Example GeoDisaster for wildfire burned-area estimation. The task uses pre/post fire imagery and a burn-mask product to compute the burned footprint area, with role-specialized tool calls for evidence loading, area measurement, and final grounded reporting. 27 GeoDisaster LLM-a...

[1] [1]

Geochat: Grounded large vision-language model for remote sensing

Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. In CVPR, 2024

2024

[2] [2]

Rescueadi: Adaptive disaster interpretation in remote sensing images with autonomous agents.IEEE Transactions on Geoscience and Remote Sensing, 2025

Zhuoran Liu, Danpei Zhao, Bo Yuan, and Zhiguo Jiang. Rescueadi: Adaptive disaster interpretation in remote sensing images with autonomous agents.IEEE Transactions on Geoscience and Remote Sensing, 2025

2025

[3] [3]

Earthgpt: A universal 13 multimodal large language model for multisensor image comprehension in remote sensing domain

Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. Earthgpt: A universal 13 multimodal large language model for multisensor image comprehension in remote sensing domain. IEEE Transactions on Geoscience and Remote Sensing, 2024

2024

[4] [4]

Openearthagent: A unified framework for tool-augmented geospatial agents.arXiv preprint arXiv:2602.17665, 2026

Akashah Shabbir, Muhammad Umer Sheikh, Muhammad Akhtar Munir, Hiyam Debary, Mustansar Fiaz, Muhammad Zaigham Zaheer, Paolo Fraccaro, Fahad Shahbaz Khan, Muhammad Haris Khan, Xiao Xiang Zhu, and Salman Khan. Openearthagent: A unified framework for tool-augmented geospatial agents.arXiv preprint arXiv:2602.17665, 2026

work page arXiv 2026

[5] [5]

Floodnet: A high resolution aerial imagery dataset for post flood scene understanding.IEEE Access, 2021

Maryam Rahnemoonfar, Tashnim Chowdhury, Argho Sarkar, Debvrat Varshney, Masoud Yari, and Robin Roberson Murphy. Floodnet: A high resolution aerial imagery dataset for post flood scene understanding.IEEE Access, 2021

2021

[6] [6]

Thinkgeo: Evaluating tool-augmented agents for remote sensing tasks.arXiv preprint arXiv:2505.23752, 2025

Akashah Shabbir, Muhammad Akhtar Munir, Akshay Dudhane, Muhammad Umer Sheikh, Muham- mad Haris Khan, Paolo Fraccaro, Juan Bernabe Moreno, Fahad Shahbaz Khan, and Salman Khan. Thinkgeo: Evaluating tool-augmented agents for remote sensing tasks.arXiv preprint arXiv:2505.23752, 2025

work page arXiv 2025

[7] [7]

Georeason: Aligning thinking and answering in remote sens- ing vision-language models via logical consistency reinforcement learning.arXiv preprint arXiv:2601.04118, 2026

Wenshuai Li, Xiantai Xiang, Zixiao Wen, Guangyao Zhou, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, and Yuxin Hu. Georeason: Aligning thinking and answering in remote sens- ing vision-language models via logical consistency reinforcement learning.arXiv preprint arXiv:2601.04118, 2026

work page arXiv 2026

[8] [8]

Pan, Shuyi Yang, Lakshya A

Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent llm systems fail? InAdvances in Neural Information Processing Systems, 2025

2025

[9] [9]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[10] [10]

Multi-agent deep research: Training multi-agent systems with m-grpo

Haoyang Hong, Jiajun Yin, Yuan Wang, Jingnan Liu, Zhe Chen, Ailing Yu, Ji Li, Zhiling Ye, Hansong Xiao, Yefei Chen, Hualei Zhou, Yun Yue, Minghui Yang, Chunxiao Guo, Junwei Liu, Peng Wei, and Jinjie Gu. Multi-agent deep research: Training multi-agent systems with m-grpo. arXiv preprint arXiv:2511.13288, 2025

work page arXiv 2025

[11] [12]

xView2: Assessing building damage from satellite imagery, 2019

Defense Innovation Unit. xView2: Assessing building damage from satellite imagery, 2019. xView2 Challenge

2019

[12] [13]

Sen1floods11: A georefer- enced dataset to train and test deep learning flood algorithms for sentinel-1

Derrick Bonafilia, Beth Tellman, Tyler Anderson, and Erica Issenberg. Sen1floods11: A georefer- enced dataset to train and test deep learning flood algorithms for sentinel-1. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020

2020

[13] [14]

Rescuenet: A high resolution uav semantic segmentation dataset for natural disaster damage assessment.Scientific Data, 2023

Maryam Rahnemoonfar, Tashnim Chowdhury, and Robin Murphy. Rescuenet: A high resolution uav semantic segmentation dataset for natural disaster damage assessment.Scientific Data, 2023

2023

[14] [15]

Bacastow

Ronny H¨ansch, Jacob Arndt, Dalton Lunga, Matthew Gibb, Tyler Pedelose, Arnold Boedihardjo, Desiree Petrie, and Todd M. Bacastow. Spacenet 8 - the detection of flooded roads and buildings. In CVPRW, 2022

2022

[15] [16]

Vqa-aid: Visual question answering for post-disaster damage assessment and analysis.IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2021

Argho Sarkar and Maryam Rahnemoonfar. Vqa-aid: Visual question answering for post-disaster damage assessment and analysis.IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2021. 14

2021

[16] [17]

Disasterm3: A remote sensing vision-language dataset for disaster damage assessment and response.Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 2025

Junjue Wang, Weihao Xuan, Heli Qi, Zhihao Liu, Kunyi Liu, Yuhan Wu, Hongruixuan Chen, Jian Song, Junshi Xia, Zhuo Zheng, and Naoto Yokoya. Disasterm3: A remote sensing vision-language dataset for disaster damage assessment and response.Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 2025

2025

[17] [18]

Hrvqa: A visual question answering benchmark for high-resolution aerial images.ISPRS Journal of Photogrammetry and Remote Sensing, 2024

Kun Li, George V osselman, and Michael Ying Yang. Hrvqa: A visual question answering benchmark for high-resolution aerial images.ISPRS Journal of Photogrammetry and Remote Sensing, 2024

2024

[18] [19]

Rsvlm- qa: A benchmark dataset for remote sensing vision language model-based question answering

Xing Zi, Jinghao Xiao, Yunxiao Shi, Xian Tao, Jun Li, Ali Braytee, and Mukesh Prasad. Rsvlm- qa: A benchmark dataset for remote sensing vision language model-based question answering. Proceedings of the 33rd ACM International Conference on Multimedia, 2025

2025

[19] [20]

Geommbench and geommagent: Toward expert-level multimodal intelligence in geoscience and remote sensing.CVPR, 2026

Aoran Xiao, Shihao Cheng, Yonghao Xu, Yexian Ren, Hongruixuan Chen, and Naoto Yokoya. Geommbench and geommagent: Toward expert-level multimodal intelligence in geoscience and remote sensing.CVPR, 2026

2026

[20] [21]

Core: Full-path evaluation of llm agents beyond final state.Workshop on Large Agentic Workflows (LAW), NeurIPS, 2025

Panagiotis Michelakis, Yiannis Hadjiyiannis, and Dimitrios Stamoulis. Core: Full-path evaluation of llm agents beyond final state.Workshop on Large Agentic Workflows (LAW), NeurIPS, 2025

2025

[21] [22]

Evaluating tool-augmented agents in remote sensing platforms.ICLR Workshop on Machine Learning for Remote Sensing (ML4RS), 2024

Simranjit Singh, Michael Fore, and Dimitrios Stamoulis. Evaluating tool-augmented agents in remote sensing platforms.ICLR Workshop on Machine Learning for Remote Sensing (ML4RS), 2024

2024

[22] [23]

Rs-agent: Automating remote sensing tasks through intelligent agents.arXiv preprint arXiv:2406.07089, 2024

Wenjia Xu, Zijian Yu, Boyang Mu, Zhiwei Wei, Yuanben Zhang, Guangzuo Li, Jiuniu Wang, and Mugen Peng. Rs-agent: Automating remote sensing tasks through intelligent agents.arXiv preprint arXiv:2406.07089, 2024

work page arXiv 2024

[23] [24]

Change- agent: Toward interactive comprehensive remote sensing change interpretation and analysis.IEEE Transactions on Geoscience and Remote Sensing, 2024

Chenyang Liu, Keyan Chen, Haotian Zhang, Zipeng Qi, Zhengxia Zou, and Zhenwei Shi. Change- agent: Toward interactive comprehensive remote sensing change interpretation and analysis.IEEE Transactions on Geoscience and Remote Sensing, 2024

2024

[24] [25]

Empowering LLM Agents with Geospatial Awareness: Toward Grounded Reasoning for Wildfire Response

Yiheng Chen, Lingyao Li, Zihui Ma, Qikai Hu, Yilun Zhu, Min Deng, and Runlong Yu. Empowering llm agents with geospatial awareness: Toward grounded reasoning for wildfire response.arXiv preprint arXiv:2510.12061, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [26]

Earth-agent: Unlocking the full landscape of earth observation with agents.ICLR, 2026

Peilin Feng, Zhutao Lv, Junyan Ye, Xiaolei Wang, Xinjie Huo, Jinhua Yu, Wanghan Xu, Wenlong Zhang, Lei Bai, Conghui He, and Weijia Li. Earth-agent: Unlocking the full landscape of earth observation with agents.ICLR, 2026

2026

[26] [27]

Multi-agent geospa- tial copilots for remote sensing workflows.IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2025

Chaehong Lee, Varatheepan Paramanayakam, Andreas Karatzas, Yanan Jian, Michael Fore, Heming Liao, Fuxun Yu, Ruopu Li, Iraklis Anagnostopoulos, and Dimitrios Stamoulis. Multi-agent geospa- tial copilots for remote sensing workflows.IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2025

2025

[27] [28]

Toolllm: Facilitating large language models to master 16000+ real-world apis.International Conference on Learning Representations 2024 (ICLR 2024), 2024

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis.International Conference on Learning Representation...

2024

[28] [29]

Agenttun- ing: Enabling generalized agent abilities for llms.Findings of the Association for Computational Linguistics: ACL 2024, 2024

Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. Agenttun- ing: Enabling generalized agent abilities for llms.Findings of the Association for Computational Linguistics: ACL 2024, 2024

2024

[29] [30]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019. 15

work page internal anchor Pith review Pith/arXiv arXiv 1909

[30] [31]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

2022

[31] [32]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2023

2023

[32] [33]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [34]

Group-in-group policy optimization for LLM agent training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for LLM agent training. InAdvances in Neural Information Processing Systems, volume 38, pages 46375–46408, 2025

2025

[34] [35]

Selaur: Self evolving llm agent via uncertainty-aware rewards.Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2026

Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray, and Hua Wei. Selaur: Self evolving llm agent via uncertainty-aware rewards.Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2026

2026

[35] [36]

Llm collaboration with multi-agent reinforcement learning.AAAI, 2026

Shuo Liu, Zeyu Liang, Xueguang Lyu, and Christopher Amato. Llm collaboration with multi-agent reinforcement learning.AAAI, 2026

2026

[36] [37]

End-to-end optimization of llm-driven multi-agent search systems via heterogeneous-group-based reinforcement learning.Association for Computational Linguistics, 2026

Guanzhong Chen, Shaoxiong Yang, Chao Li, Wei Liu, Jian Luan, and Zenglin Xu. End-to-end optimization of llm-driven multi-agent search systems via heterogeneous-group-based reinforcement learning.Association for Computational Linguistics, 2026

2026

[37] [38]

Lang Feng, Longtao Zheng, Shuo He, Fuxiang Zhang, and Bo An. Dr. mas: Stable reinforcement learning for multi-agent llm systems.arXiv preprint arXiv:2602.08847, 2026

work page arXiv 2026

[38] [39]

Osmnx: New methods for acquiring, constructing, analyzing, and visualizing complex street networks.Computers, Environment and Urban Systems, 65:126–139, 2017

Geoff Boeing. Osmnx: New methods for acquiring, constructing, analyzing, and visualizing complex street networks.Computers, Environment and Urban Systems, 65:126–139, 2017. doi: 10.1016/j.compenvurbsys.2017.05.004

work page doi:10.1016/j.compenvurbsys.2017.05.004 2017

[39] [40]

Hansen, Peter V

Matthew C. Hansen, Peter V . Potapov, Rebecca Moore, Matt Hancher, Svetlana A. Turubanova, Alexandra Tyukavina, David Thau, Stephen V . Stehman, Scott J. Goetz, Thomas R. Loveland, et al. High-resolution global maps of 21st-century forest cover change.Science, 342(6160):850–853,

[40] [41]

doi: 10.1126/science.1244693

work page doi:10.1126/science.1244693

[41] [42]

xbd: A dataset for as- sessing building damage from satellite imagery,

Ritwik Gupta, Bryce Goodman, Nirav Patel, Richard Hosfelt, Sandra Sajeev, Eric Heim, Jigar Doshi, Keane Lucas, Howie Choset, and Matthew Gaston. Creating xbd: A dataset for assessing building damage from satellite imagery.arXiv preprint arXiv:1911.09296, 2019

work page arXiv 1911

[42] [43]

goal":"generate SAR flood report with known-water split

Copernicus Emergency Management Service. Copernicus emergency management service. https: //emergency.copernicus.eu/, 2026. Flood extent and emergency mapping products. 16 SUPPLEMENTARY MATERIAL S1 Dataset Details and Statistics GeoDisaster contains 2,921 verified instances across five disaster task families and 43 question types. Beyond dataset size, its ...

2026

[43] [44]

goal":"xBD summary

ORC plan:{"goal":"xBD summary", "steps":[{"agent":"GA", "subtask":"prepare evidence"}, {"agent":"VRA", "subtask":"assess damage"}, {"agent":"PA", "subtask":"compute statistics"}]}

[44] [45]

{task: Prepare xBD evidence, evidence: [post-disaster RGB, mask], stop criteria: work_1 prepared with polygons, severe mask, and quadrant grid}

ORC→GA: CallAgent(agent=GA, contract="{task: Prepare xBD evidence, evidence: [post-disaster RGB, mask], stop criteria: work_1 prepared with polygons, severe mask, and quadrant grid}")

[45] [46]

GA tool:ReadData(inputs=image_1);PrepareScene(mode=xbd_scene_summary, grid=2x2)

[46] [47]

GA Tool Output:scene_1;271polygons and288ksevere pixels loaded.work_1prepared with2×2quadrant grid

[47] [48]

ORC→VRA: CallAgent(agent=VRA, contract="{task: Assess scene-level building damage, evidence: [work_1, building polygons, post-disaster RGB, target mask], stop criteria: damage_1 produced with per-building and scene-level damage class}")

[48] [49]

VRA tool:AssessDamage(work=work_1, aggregation=scene)

[49] [50]

VRA Tool Output:damage_1;271/271buildings destroyed, corresponding to100%severe damage

[50] [51]

ORC→P A: CallAgent(agent=PA, contract="{task: Compute area, mask validation, and spatial concentration, evidence: [work_1, damage_1, severe mask, quadrant grid], stop criteria: stats_1 produced with area, dispersion score, and worst quadrant}")

[51] [52]

P A tool:ComputeStats(work=work_1, metrics=[area, mask_validation, dispersion, worst_quadrant])

[52] [53]

Scene is spatially dispersed with complete severe-damage dominance

P A Tool Output:stats_1; severe area =954,478 m 2; dispersion score =0.34; worst quadrant =Q0. Scene is spatially dispersed with complete severe-damage dominance

[53] [54]

destroyed

ORC: Terminate:{"destroyed":271, "severe_pct":100, "severe_m2":954478, "dispersion":"dispersed", "worst_Q":"Q0"} Figure S7:Example GeoDisaster for scene-level building-damage assessment. The task combines pre/post imagery, damage polygons, quadrant overlays, and role-specialized tool calls to produce a structured damage summary with counts, severe-area es...

2021

[54] [55]

goal":"estimate burned area

ORC plan:{"goal":"estimate burned area", "steps":[{"agent":"GA", "subtask":"prepare burned-area context"}, {"agent":"PA", "subtask":"measure footprint area"}, {"agent":"PA", "subtask":"validate agreement"}]}

[55] [56]

{task: Open burned-area product for AOI, evidence: [pre-fire image, post-fire image, burn product, AOI boundary], stop criteria: work_1 prepared with burn mask and scene context}

ORC→GA: CallAgent(agent=GA, contract="{task: Open burned-area product for AOI, evidence: [pre-fire image, post-fire image, burn product, AOI boundary], stop criteria: work_1 prepared with burn mask and scene context}")

[56] [57]

GA tool:LoadCachedRaster(aoi_id=dixie_fire_aoi, handle=mcd64a1_event); PrepareScene(inputs={burned_area:scene_1}, mode=burned_area_assessment)

[57] [58]

GA Tool Output:scene_1; MODIS MCD64A1 BurnDate loaded for2021-07-01to2021-09-29; modality =burned_area_mask; scale =500m.work_1prepared with binary burn mask

[58] [59]

ORC→P A: CallAgent(agent=PA, contract="{task: Compute burned footprint area, evidence: [work_1, burn mask, AOI boundary, post-fire overlay], stop criteria: burned-area estimate produced in hectares with source agreement}")

[59] [60]

P A tool:ComputeArea(work=work_1, mask=burned_area_mask, unit=ha); CompareAcrossSources(handles=[mcd64a1_event], metric=area_agreement)

[60] [61]

MODIS MCD64A1 area =17,874.54 ha; agreement class =high

P A Tool Output:burned_area_1; one source used. MODIS MCD64A1 area =17,874.54 ha; agreement class =high

[61] [62]

value":17874.54,

ORC: Terminate:{"value":17874.54, "units":"ha", "source":"MODIS MCD64A1", "evidence":"binary burn mask + AOI boundary"}

[62] [63]

Figure S8:Example GeoDisaster for wildfire burned-area estimation

ORC: Final Answer:Burned area =17,874.54 ha, evidence:MODIS MCD64A1. Figure S8:Example GeoDisaster for wildfire burned-area estimation. The task uses pre/post fire imagery and a burn-mask product to compute the burned footprint area, with role-specialized tool calls for evidence loading, area measurement, and final grounded reporting. 27 GeoDisaster LLM-a...