pith. sign in

arxiv: 2606.17246 · v1 · pith:26SRDB7Rnew · submitted 2026-06-15 · 💻 cs.CV · cs.MA

GeoDisaster: Benchmarking Orchestrated Agents for Operational Disaster Geo-Intelligence

Pith reviewed 2026-06-27 03:45 UTC · model grok-4.3

classification 💻 cs.CV cs.MA
keywords GeoDisaster benchmarkremote sensing vision-language modelsmulti-agent orchestrationdisaster geo-intelligencegeospatial tool useRole-Contract Expectation AlignmentSAR and optical imageryoperational decision generation
0
0 comments X

The pith

GeoDisaster benchmark requires tool-grounded spatial reasoning for disaster tasks that current RS-VLMs cannot meet, while RCEA alignment improves agent tool use and consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GeoDisaster as a benchmark of 2,921 instances spanning five disaster task families that demand integration of optical imagery, SAR data, vector layers, and road networks into executable decisions. It claims existing remote-sensing vision-language models and agent systems fall short on operational geo-intelligence because they lack structured tool access and evidence-backed outputs. Ground truth is generated from deterministic geospatial workflows rather than language-model judgments. The authors introduce an orchestrated multi-agent system with 18 tools and Role-Contract Expectation Alignment to coordinate specialized agents via explicit contracts, supervised fine-tuning, and reinforcement learning. Experiments indicate the benchmark exposes limitations and that RCEA yields gains in tool selection, evidence grounding, state tracking, and final decisions.

Core claim

GeoDisaster supplies 2,921 verified instances across 43 question types that integrate heterogeneous EO/GIS evidence and require hazard detection, damage assessment, exposure estimation, and report generation; ground-truth labels derive directly from executable geospatial workflows and consistency checks. The Role-Contract Expectation Alignment method aligns role-specialized agents through failure-aware supervised fine-tuning and contract-grounded reinforcement learning over dense step-level signals, producing measurable improvements in tool use, evidence grounding, state consistency, and decision generation over prior RS-VLMs and agentic baselines.

What carries the argument

Role-Contract Expectation Alignment (RCEA), a training procedure that combines failure-aware supervised fine-tuning with contract-grounded reinforcement learning to enforce explicit execution contracts among 18 disaster-oriented tools coordinated by role-specialized agents.

If this is right

  • Existing RS-VLMs and agentic systems will continue to underperform on tasks that require chaining heterogeneous geospatial tools and producing evidence-backed outputs.
  • Benchmarks grounded in executable workflows eliminate reliance on language-model-generated labels for disaster-related spatial reasoning.
  • Multi-agent coordination via explicit contracts and step-level reinforcement signals improves reliability in state tracking and decision generation.
  • The five task families provide standardized evaluation for deforestation monitoring, multi-hazard analysis, building-damage assessment, flood-safe routing, and SAR flood monitoring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar contract-based alignment could be tested on non-disaster geospatial tasks such as infrastructure monitoring or agricultural yield estimation.
  • The benchmark design suggests that future agent systems may need native integration with GIS execution environments rather than text-only interfaces.
  • If RCEA gains hold, operational centers could adopt role-specialized agent teams for rapid disaster assessment instead of single monolithic models.

Load-bearing premise

Executable geospatial workflows and deterministic consistency checks can serve as complete, unbiased ground truth for operational disaster reasoning without missing subjective or contextual factors.

What would settle it

Run the 18-tool RCEA agents and baseline systems on the full 2,921 instances and measure whether RCEA produces statistically higher accuracy in tool selection, evidence citation, state consistency, or final decision correctness; absence of such gains would falsify the improvement claim.

Figures

Figures reproduced from arXiv: 2606.17246 by Aman Verma, Biplab Banerjee, Daksh Jain, Hariseetharam Gunduboina, Maram Hasan, Muhammad Haris Khan, Savitra Roy, Subhasis Chaudhuri.

Figure 1
Figure 1. Figure 1: Overview of the GeoDisaster pipeline. Public EO/GIS sources are ingested and standardized [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples from GeoDisaster task families. Left: A flood-safe routing example in Sweden, where satellite context and flood-routing evidence are used to generate a route overlay and compare fastest, safest, and balanced routes. Right: A multi-hazard NO2 exposure example in Los Angeles, where satellite context and exposure artifacts support population-exposure estimation. The traces illustrate representative c… view at source ↗
read the original abstract

Remote-sensing vision-language models (RS-VLMs) have advanced Earth-observation analysis toward visual interpretation and instruction-following, yet fall short of operational geo-intelligence, which demands tool-grounded spatial reasoning and structured, evidence-backed decisions. We introduce GeoDisaster, an operational geospatial disaster reasoning benchmark with 2,921 verified instances across 43 question types and five task families: deforestation monitoring, multi-hazard analysis, building-damage assessment, flood-safe routing, and Sentinel-1 SAR flood monitoring. Instances integrate heterogeneous EO/GIS evidence-optical and SAR imagery, raster masks, vector geometries, road networks, and exposure layers-spanning hazard detection, damage assessment, exposure estimation, and diagnostic report generation. Ground-truth answers are grounded in executable geospatial workflows and deterministic consistency checks, removing the need for language-model annotation. We further propose an orchestrated multi-agent framework with 18 disaster-oriented tools, where role-specialized agents coordinate through explicit execution contracts, aligned via Role-Contract Expectation Alignment (RCEA): failure-aware supervised fine-tuning combined with contract-grounded reinforcement learning over dense step-level signals. Experiments show that GeoDisaster challenges existing RS-VLMs and agentic systems, while RCEA improves tool use, evidence grounding, state consistency, and decision generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces GeoDisaster, a benchmark with 2,921 verified instances across 43 question types and five task families (deforestation monitoring, multi-hazard analysis, building-damage assessment, flood-safe routing, Sentinel-1 SAR flood monitoring) for operational geospatial disaster reasoning. Ground truth is derived from executable geospatial workflows and deterministic consistency checks rather than LM annotation. It proposes an orchestrated multi-agent system with 18 disaster-oriented tools coordinated via explicit execution contracts, trained with Role-Contract Expectation Alignment (RCEA) that combines failure-aware supervised fine-tuning and contract-grounded reinforcement learning. The central claim is that the benchmark challenges existing RS-VLMs and agentic systems while RCEA yields improvements in tool use, evidence grounding, state consistency, and decision generation.

Significance. If the experimental results hold, the benchmark's construction from executable workflows and deterministic checks represents a methodological strength that could reduce annotation bias in geo-intelligence evaluation. The RCEA regime for multi-agent coordination addresses a practical gap in deploying agents for structured, evidence-backed decisions on heterogeneous EO/GIS data. This could support more reliable operational systems if the claimed gains are reproducible.

major comments (1)
  1. [Abstract] Abstract: The claim that 'Experiments show that GeoDisaster challenges existing RS-VLMs and agentic systems, while RCEA improves tool use, evidence grounding, state consistency, and decision generation' is unsupported by any quantitative results, baseline comparisons, ablation studies, metrics, or error analysis. This absence is load-bearing for the central claim, as the data-to-claim link cannot be evaluated from the manuscript.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for identifying this critical issue with the abstract. We address the comment point-by-point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'Experiments show that GeoDisaster challenges existing RS-VLMs and agentic systems, while RCEA improves tool use, evidence grounding, state consistency, and decision generation' is unsupported by any quantitative results, baseline comparisons, ablation studies, metrics, or error analysis. This absence is load-bearing for the central claim, as the data-to-claim link cannot be evaluated from the manuscript.

    Authors: We agree that the abstract currently asserts experimental outcomes without including, referencing, or summarizing any supporting quantitative evidence, baselines, metrics, or analysis from the manuscript. This renders the claim unevaluable as written. In the revised manuscript we will either (a) add a concise summary of key results (e.g., tool-use accuracy, consistency scores, and comparative deltas versus baselines) directly into the abstract or (b) rephrase the final sentence to describe the benchmark and RCEA framework without asserting unsupported performance gains. The Experiments section will remain the primary location for all quantitative details, ablations, and error analysis. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an empirical benchmark (GeoDisaster) and an agent framework (RCEA) whose central claims rest on experimental results and deterministic ground-truth construction from executable geospatial workflows, not on any derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing premises. No equations, uniqueness theorems, or ansatzes appear in the provided text, and the ground-truth mechanism is explicitly designed to be independent of language-model annotation. The absence of any load-bearing reduction to inputs by construction makes the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, parameters, or modeling assumptions are stated, so the ledger cannot be populated beyond the high-level claim that ground-truth derives from executable workflows.

pith-pipeline@v0.9.1-grok · 5788 in / 1194 out tokens · 65302 ms · 2026-06-27T03:45:59.641831+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 13 canonical work pages · 4 internal anchors

  1. [1]

    Geochat: Grounded large vision-language model for remote sensing

    Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. In CVPR, 2024

  2. [2]

    Rescueadi: Adaptive disaster interpretation in remote sensing images with autonomous agents.IEEE Transactions on Geoscience and Remote Sensing, 2025

    Zhuoran Liu, Danpei Zhao, Bo Yuan, and Zhiguo Jiang. Rescueadi: Adaptive disaster interpretation in remote sensing images with autonomous agents.IEEE Transactions on Geoscience and Remote Sensing, 2025

  3. [3]

    Earthgpt: A universal 13 multimodal large language model for multisensor image comprehension in remote sensing domain

    Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. Earthgpt: A universal 13 multimodal large language model for multisensor image comprehension in remote sensing domain. IEEE Transactions on Geoscience and Remote Sensing, 2024

  4. [4]

    Openearthagent: A unified framework for tool-augmented geospatial agents.arXiv preprint arXiv:2602.17665, 2026

    Akashah Shabbir, Muhammad Umer Sheikh, Muhammad Akhtar Munir, Hiyam Debary, Mustansar Fiaz, Muhammad Zaigham Zaheer, Paolo Fraccaro, Fahad Shahbaz Khan, Muhammad Haris Khan, Xiao Xiang Zhu, and Salman Khan. Openearthagent: A unified framework for tool-augmented geospatial agents.arXiv preprint arXiv:2602.17665, 2026

  5. [5]

    Floodnet: A high resolution aerial imagery dataset for post flood scene understanding.IEEE Access, 2021

    Maryam Rahnemoonfar, Tashnim Chowdhury, Argho Sarkar, Debvrat Varshney, Masoud Yari, and Robin Roberson Murphy. Floodnet: A high resolution aerial imagery dataset for post flood scene understanding.IEEE Access, 2021

  6. [6]

    Thinkgeo: Evaluating tool-augmented agents for remote sensing tasks.arXiv preprint arXiv:2505.23752, 2025

    Akashah Shabbir, Muhammad Akhtar Munir, Akshay Dudhane, Muhammad Umer Sheikh, Muham- mad Haris Khan, Paolo Fraccaro, Juan Bernabe Moreno, Fahad Shahbaz Khan, and Salman Khan. Thinkgeo: Evaluating tool-augmented agents for remote sensing tasks.arXiv preprint arXiv:2505.23752, 2025

  7. [7]

    Georeason: Aligning thinking and answering in remote sens- ing vision-language models via logical consistency reinforcement learning.arXiv preprint arXiv:2601.04118, 2026

    Wenshuai Li, Xiantai Xiang, Zixiao Wen, Guangyao Zhou, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, and Yuxin Hu. Georeason: Aligning thinking and answering in remote sens- ing vision-language models via logical consistency reinforcement learning.arXiv preprint arXiv:2601.04118, 2026

  8. [8]

    Pan, Shuyi Yang, Lakshya A

    Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent llm systems fail? InAdvances in Neural Information Processing Systems, 2025

  9. [9]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  10. [10]

    Multi-agent deep research: Training multi-agent systems with m-grpo

    Haoyang Hong, Jiajun Yin, Yuan Wang, Jingnan Liu, Zhe Chen, Ailing Yu, Ji Li, Zhiling Ye, Hansong Xiao, Yefei Chen, Hualei Zhou, Yun Yue, Minghui Yang, Chunxiao Guo, Junwei Liu, Peng Wei, and Jinjie Gu. Multi-agent deep research: Training multi-agent systems with m-grpo. arXiv preprint arXiv:2511.13288, 2025

  11. [12]

    xView2: Assessing building damage from satellite imagery, 2019

    Defense Innovation Unit. xView2: Assessing building damage from satellite imagery, 2019. xView2 Challenge

  12. [13]

    Sen1floods11: A georefer- enced dataset to train and test deep learning flood algorithms for sentinel-1

    Derrick Bonafilia, Beth Tellman, Tyler Anderson, and Erica Issenberg. Sen1floods11: A georefer- enced dataset to train and test deep learning flood algorithms for sentinel-1. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020

  13. [14]

    Rescuenet: A high resolution uav semantic segmentation dataset for natural disaster damage assessment.Scientific Data, 2023

    Maryam Rahnemoonfar, Tashnim Chowdhury, and Robin Murphy. Rescuenet: A high resolution uav semantic segmentation dataset for natural disaster damage assessment.Scientific Data, 2023

  14. [15]

    Bacastow

    Ronny H¨ansch, Jacob Arndt, Dalton Lunga, Matthew Gibb, Tyler Pedelose, Arnold Boedihardjo, Desiree Petrie, and Todd M. Bacastow. Spacenet 8 - the detection of flooded roads and buildings. In CVPRW, 2022

  15. [16]

    Vqa-aid: Visual question answering for post-disaster damage assessment and analysis.IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2021

    Argho Sarkar and Maryam Rahnemoonfar. Vqa-aid: Visual question answering for post-disaster damage assessment and analysis.IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2021. 14

  16. [17]

    Disasterm3: A remote sensing vision-language dataset for disaster damage assessment and response.Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 2025

    Junjue Wang, Weihao Xuan, Heli Qi, Zhihao Liu, Kunyi Liu, Yuhan Wu, Hongruixuan Chen, Jian Song, Junshi Xia, Zhuo Zheng, and Naoto Yokoya. Disasterm3: A remote sensing vision-language dataset for disaster damage assessment and response.Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 2025

  17. [18]

    Hrvqa: A visual question answering benchmark for high-resolution aerial images.ISPRS Journal of Photogrammetry and Remote Sensing, 2024

    Kun Li, George V osselman, and Michael Ying Yang. Hrvqa: A visual question answering benchmark for high-resolution aerial images.ISPRS Journal of Photogrammetry and Remote Sensing, 2024

  18. [19]

    Rsvlm- qa: A benchmark dataset for remote sensing vision language model-based question answering

    Xing Zi, Jinghao Xiao, Yunxiao Shi, Xian Tao, Jun Li, Ali Braytee, and Mukesh Prasad. Rsvlm- qa: A benchmark dataset for remote sensing vision language model-based question answering. Proceedings of the 33rd ACM International Conference on Multimedia, 2025

  19. [20]

    Geommbench and geommagent: Toward expert-level multimodal intelligence in geoscience and remote sensing.CVPR, 2026

    Aoran Xiao, Shihao Cheng, Yonghao Xu, Yexian Ren, Hongruixuan Chen, and Naoto Yokoya. Geommbench and geommagent: Toward expert-level multimodal intelligence in geoscience and remote sensing.CVPR, 2026

  20. [21]

    Core: Full-path evaluation of llm agents beyond final state.Workshop on Large Agentic Workflows (LAW), NeurIPS, 2025

    Panagiotis Michelakis, Yiannis Hadjiyiannis, and Dimitrios Stamoulis. Core: Full-path evaluation of llm agents beyond final state.Workshop on Large Agentic Workflows (LAW), NeurIPS, 2025

  21. [22]

    Evaluating tool-augmented agents in remote sensing platforms.ICLR Workshop on Machine Learning for Remote Sensing (ML4RS), 2024

    Simranjit Singh, Michael Fore, and Dimitrios Stamoulis. Evaluating tool-augmented agents in remote sensing platforms.ICLR Workshop on Machine Learning for Remote Sensing (ML4RS), 2024

  22. [23]

    Rs-agent: Automating remote sensing tasks through intelligent agents.arXiv preprint arXiv:2406.07089, 2024

    Wenjia Xu, Zijian Yu, Boyang Mu, Zhiwei Wei, Yuanben Zhang, Guangzuo Li, Jiuniu Wang, and Mugen Peng. Rs-agent: Automating remote sensing tasks through intelligent agents.arXiv preprint arXiv:2406.07089, 2024

  23. [24]

    Change- agent: Toward interactive comprehensive remote sensing change interpretation and analysis.IEEE Transactions on Geoscience and Remote Sensing, 2024

    Chenyang Liu, Keyan Chen, Haotian Zhang, Zipeng Qi, Zhengxia Zou, and Zhenwei Shi. Change- agent: Toward interactive comprehensive remote sensing change interpretation and analysis.IEEE Transactions on Geoscience and Remote Sensing, 2024

  24. [25]

    Empowering LLM Agents with Geospatial Awareness: Toward Grounded Reasoning for Wildfire Response

    Yiheng Chen, Lingyao Li, Zihui Ma, Qikai Hu, Yilun Zhu, Min Deng, and Runlong Yu. Empowering llm agents with geospatial awareness: Toward grounded reasoning for wildfire response.arXiv preprint arXiv:2510.12061, 2025

  25. [26]

    Earth-agent: Unlocking the full landscape of earth observation with agents.ICLR, 2026

    Peilin Feng, Zhutao Lv, Junyan Ye, Xiaolei Wang, Xinjie Huo, Jinhua Yu, Wanghan Xu, Wenlong Zhang, Lei Bai, Conghui He, and Weijia Li. Earth-agent: Unlocking the full landscape of earth observation with agents.ICLR, 2026

  26. [27]

    Multi-agent geospa- tial copilots for remote sensing workflows.IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2025

    Chaehong Lee, Varatheepan Paramanayakam, Andreas Karatzas, Yanan Jian, Michael Fore, Heming Liao, Fuxun Yu, Ruopu Li, Iraklis Anagnostopoulos, and Dimitrios Stamoulis. Multi-agent geospa- tial copilots for remote sensing workflows.IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2025

  27. [28]

    Toolllm: Facilitating large language models to master 16000+ real-world apis.International Conference on Learning Representations 2024 (ICLR 2024), 2024

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis.International Conference on Learning Representation...

  28. [29]

    Agenttun- ing: Enabling generalized agent abilities for llms.Findings of the Association for Computational Linguistics: ACL 2024, 2024

    Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. Agenttun- ing: Enabling generalized agent abilities for llms.Findings of the Association for Computational Linguistics: ACL 2024, 2024

  29. [30]

    Fine-Tuning Language Models from Human Preferences

    Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019. 15

  30. [31]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

  31. [32]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2023

  32. [33]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  33. [34]

    Group-in-group policy optimization for LLM agent training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for LLM agent training. InAdvances in Neural Information Processing Systems, volume 38, pages 46375–46408, 2025

  34. [35]

    Selaur: Self evolving llm agent via uncertainty-aware rewards.Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2026

    Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray, and Hua Wei. Selaur: Self evolving llm agent via uncertainty-aware rewards.Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2026

  35. [36]

    Llm collaboration with multi-agent reinforcement learning.AAAI, 2026

    Shuo Liu, Zeyu Liang, Xueguang Lyu, and Christopher Amato. Llm collaboration with multi-agent reinforcement learning.AAAI, 2026

  36. [37]

    End-to-end optimization of llm-driven multi-agent search systems via heterogeneous-group-based reinforcement learning.Association for Computational Linguistics, 2026

    Guanzhong Chen, Shaoxiong Yang, Chao Li, Wei Liu, Jian Luan, and Zenglin Xu. End-to-end optimization of llm-driven multi-agent search systems via heterogeneous-group-based reinforcement learning.Association for Computational Linguistics, 2026

  37. [38]

    Lang Feng, Longtao Zheng, Shuo He, Fuxiang Zhang, and Bo An. Dr. mas: Stable reinforcement learning for multi-agent llm systems.arXiv preprint arXiv:2602.08847, 2026

  38. [39]

    Osmnx: New methods for acquiring, constructing, analyzing, and visualizing complex street networks.Computers, Environment and Urban Systems, 65:126–139, 2017

    Geoff Boeing. Osmnx: New methods for acquiring, constructing, analyzing, and visualizing complex street networks.Computers, Environment and Urban Systems, 65:126–139, 2017. doi: 10.1016/j.compenvurbsys.2017.05.004

  39. [40]

    Hansen, Peter V

    Matthew C. Hansen, Peter V . Potapov, Rebecca Moore, Matt Hancher, Svetlana A. Turubanova, Alexandra Tyukavina, David Thau, Stephen V . Stehman, Scott J. Goetz, Thomas R. Loveland, et al. High-resolution global maps of 21st-century forest cover change.Science, 342(6160):850–853,

  40. [41]

    doi: 10.1126/science.1244693

  41. [42]

    xbd: A dataset for as- sessing building damage from satellite imagery,

    Ritwik Gupta, Bryce Goodman, Nirav Patel, Richard Hosfelt, Sandra Sajeev, Eric Heim, Jigar Doshi, Keane Lucas, Howie Choset, and Matthew Gaston. Creating xbd: A dataset for assessing building damage from satellite imagery.arXiv preprint arXiv:1911.09296, 2019

  42. [43]

    goal":"generate SAR flood report with known-water split

    Copernicus Emergency Management Service. Copernicus emergency management service. https: //emergency.copernicus.eu/, 2026. Flood extent and emergency mapping products. 16 SUPPLEMENTARY MATERIAL S1 Dataset Details and Statistics GeoDisaster contains 2,921 verified instances across five disaster task families and 43 question types. Beyond dataset size, its ...

  43. [44]

    goal":"xBD summary

    ORC plan:{"goal":"xBD summary", "steps":[{"agent":"GA", "subtask":"prepare evidence"}, {"agent":"VRA", "subtask":"assess damage"}, {"agent":"PA", "subtask":"compute statistics"}]}

  44. [45]

    {task: Prepare xBD evidence, evidence: [post-disaster RGB, mask], stop criteria: work_1 prepared with polygons, severe mask, and quadrant grid}

    ORC→GA: CallAgent(agent=GA, contract="{task: Prepare xBD evidence, evidence: [post-disaster RGB, mask], stop criteria: work_1 prepared with polygons, severe mask, and quadrant grid}")

  45. [46]

    GA tool:ReadData(inputs=image_1);PrepareScene(mode=xbd_scene_summary, grid=2x2)

  46. [47]

    GA Tool Output:scene_1;271polygons and288ksevere pixels loaded.work_1prepared with2×2quadrant grid

  47. [48]

    ORC→VRA: CallAgent(agent=VRA, contract="{task: Assess scene-level building damage, evidence: [work_1, building polygons, post-disaster RGB, target mask], stop criteria: damage_1 produced with per-building and scene-level damage class}")

  48. [49]

    VRA tool:AssessDamage(work=work_1, aggregation=scene)

  49. [50]

    VRA Tool Output:damage_1;271/271buildings destroyed, corresponding to100%severe damage

  50. [51]

    ORC→P A: CallAgent(agent=PA, contract="{task: Compute area, mask validation, and spatial concentration, evidence: [work_1, damage_1, severe mask, quadrant grid], stop criteria: stats_1 produced with area, dispersion score, and worst quadrant}")

  51. [52]

    P A tool:ComputeStats(work=work_1, metrics=[area, mask_validation, dispersion, worst_quadrant])

  52. [53]

    Scene is spatially dispersed with complete severe-damage dominance

    P A Tool Output:stats_1; severe area =954,478 m 2; dispersion score =0.34; worst quadrant =Q0. Scene is spatially dispersed with complete severe-damage dominance

  53. [54]

    destroyed

    ORC: Terminate:{"destroyed":271, "severe_pct":100, "severe_m2":954478, "dispersion":"dispersed", "worst_Q":"Q0"} Figure S7:Example GeoDisaster for scene-level building-damage assessment. The task combines pre/post imagery, damage polygons, quadrant overlays, and role-specialized tool calls to produce a structured damage summary with counts, severe-area es...

  54. [55]

    goal":"estimate burned area

    ORC plan:{"goal":"estimate burned area", "steps":[{"agent":"GA", "subtask":"prepare burned-area context"}, {"agent":"PA", "subtask":"measure footprint area"}, {"agent":"PA", "subtask":"validate agreement"}]}

  55. [56]

    {task: Open burned-area product for AOI, evidence: [pre-fire image, post-fire image, burn product, AOI boundary], stop criteria: work_1 prepared with burn mask and scene context}

    ORC→GA: CallAgent(agent=GA, contract="{task: Open burned-area product for AOI, evidence: [pre-fire image, post-fire image, burn product, AOI boundary], stop criteria: work_1 prepared with burn mask and scene context}")

  56. [57]

    GA tool:LoadCachedRaster(aoi_id=dixie_fire_aoi, handle=mcd64a1_event); PrepareScene(inputs={burned_area:scene_1}, mode=burned_area_assessment)

  57. [58]

    GA Tool Output:scene_1; MODIS MCD64A1 BurnDate loaded for2021-07-01to2021-09-29; modality =burned_area_mask; scale =500m.work_1prepared with binary burn mask

  58. [59]

    ORC→P A: CallAgent(agent=PA, contract="{task: Compute burned footprint area, evidence: [work_1, burn mask, AOI boundary, post-fire overlay], stop criteria: burned-area estimate produced in hectares with source agreement}")

  59. [60]

    P A tool:ComputeArea(work=work_1, mask=burned_area_mask, unit=ha); CompareAcrossSources(handles=[mcd64a1_event], metric=area_agreement)

  60. [61]

    MODIS MCD64A1 area =17,874.54 ha; agreement class =high

    P A Tool Output:burned_area_1; one source used. MODIS MCD64A1 area =17,874.54 ha; agreement class =high

  61. [62]

    value":17874.54,

    ORC: Terminate:{"value":17874.54, "units":"ha", "source":"MODIS MCD64A1", "evidence":"binary burn mask + AOI boundary"}

  62. [63]

    Figure S8:Example GeoDisaster for wildfire burned-area estimation

    ORC: Final Answer:Burned area =17,874.54 ha, evidence:MODIS MCD64A1. Figure S8:Example GeoDisaster for wildfire burned-area estimation. The task uses pre/post fire imagery and a burn-mask product to compute the burned footprint area, with role-specialized tool calls for evidence loading, area measurement, and final grounded reporting. 27 GeoDisaster LLM-a...