pith. machine review for the scientific record. sign in

arxiv: 2605.11633 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations

Heli Qi, Hongruixuan Chen, Junjue Wang, Junshi Xia, Kunyi Liu, Naoto Yokoya, Pengyu Dai, Stefano Ermon, Weihao Xuan, Zhuo Zheng

Pith reviewed 2026-05-13 01:20 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentsdisaster responsegeospatial reasoningagent benchmarktool useemergency operationsmulti-modal dataspatial planning
0
0 comments X

The pith

LLM agents for disaster response are doubly limited by tool selection failures and rapid performance drops on longer compositional pipelines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the DORA benchmark, which supplies 515 expert-written tasks drawn from 45 real disasters together with 3,500 verified tool trajectories that exercise a 108-tool library over optical, SAR, elevation, and vector data. Evaluation of thirteen frontier models shows that agents repeatedly mis-select tools or ground their arguments incorrectly, and that supplying the correct tool order improves accuracy by only a few percentage points. Accuracy also declines sharply as the number of required tool calls grows, exposing a basic inability to maintain coherent multi-step geospatial reasoning through an entire operational workflow. These results matter because responders need agents that can fuse multi-sensor inputs, plan evacuations, and generate reports without constant human intervention.

Core claim

The DORA benchmark demonstrates that current LLM agents cannot reliably execute end-to-end disaster-response pipelines: they exhibit domain-specific grounding errors in damage semantics and sensor modalities, remain bottlenecked by both tool selection and argument generation even when given gold tool sequences, and suffer widening gaps relative to expert trajectories as pipeline length increases from short to long sequences.

What carries the argument

The DORA benchmark, a collection of 515 tasks and 3,500 expert-verified gold trajectories that require agents to compose calls from a heterogeneous 108-tool geospatial library spanning multi-temporal imagery and vector layers.

Load-bearing premise

That the 515 expert-authored tasks and 3,500 gold trajectories accurately capture the full operational disaster-response pipeline without selection bias or oversimplification of real-world constraints.

What would settle it

Running the same 13 models on a fresh collection of live disaster events using the identical 108-tool library and measuring whether the observed tool-selection and length-dependent gaps remain or shrink.

Figures

Figures reproduced from arXiv: 2605.11633 by Heli Qi, Hongruixuan Chen, Junjue Wang, Junshi Xia, Kunyi Liu, Naoto Yokoya, Pengyu Dai, Stefano Ermon, Weihao Xuan, Zhuo Zheng.

Figure 1
Figure 1. Figure 1: Representative task examples in DORA across different disaster operational categories. Each example illustrates the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: DORA aggregates multi-modal data from 10 open-source databases into 45 disaster events distributed across five [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of 515 tasks across five an [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Complexity and tool-usage profiles across DORA’s five dimensions. (a) Trajectory length grows from [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Our annotation pipeline: (1) experts author queries and symbolic trajectories; (2) a deterministic replay engine resolves [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Failure modes in disaster domain [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Agent performances (%) decomposed by input modality configuration. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Agents vs. gold trajectory: compositional [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
read the original abstract

Operational disaster response goes beyond damage assessment, requiring responders to integrate multi-sensor signals, reason over road networks, populations and key facilities, plan evacuations, and produce actionable reports. However, prior work largely isolates remote-sensing perception or evaluates generic tool use, leaving the end-to-end workflows of emergency operations underexplored. In this paper, we introduce Disaster Operational Response Agent benchmark (DORA), the first agentic benchmark for end-to-end disaster response: 515 expert-authored tasks across 45 real-world disaster events spanning 10 types, paired with expert-verified, replayable gold trajectories totaling 3,500 tool-call steps. Tasks span five dimensions that cover the operational disaster-response pipeline: disaster perception, spatial relational analysis, rescue and evacuation planning, temporal evolution reasoning, and multi-modal report synthesis. Agents compose calls from a 108-tool MCP library over heterogeneous geospatial data: optical, SAR, and multi-spectral imagery across single-, bi-, and multi-temporal sequences (0.015-10m GSD), complemented by elevation and social vector layers. We comprehensively evaluate 13 frontier LLMs on our benchmark, revealing three persistent challenges: 1) disaster-domain grounding exposes unique failure modes (damage-semantic grounding, sensor-modality mismatch, and disaster-pipeline composition); 2) agents are doubly bottlenecked by tool selection and argument grounding, where gold tool-order hints improve accuracy by only 1.08-4.40%, and alternative scaffolds yield at most a 3.24% gain; 3) compositional fragility scales with trajectory length, the agent-to-gold gap widening from 7% to 56% on long pipelines. DORA establishes a rigorous testbed for operationally reliable disaster-response agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DORA, the first agentic benchmark for end-to-end disaster response. It comprises 515 expert-authored tasks spanning 45 real-world events across 10 disaster types, paired with 3,500 expert-verified replayable gold trajectories that invoke a 108-tool library over heterogeneous geospatial data (optical/SAR/multi-spectral imagery, elevation, and social vectors). The work evaluates 13 frontier LLMs across five operational dimensions (perception, spatial analysis, rescue planning, temporal reasoning, report synthesis) and reports three challenges: domain-specific grounding failures, double bottlenecks in tool selection and argument grounding (with gold-order hints yielding only 1.08-4.40% gains and scaffolds at most 3.24%), and compositional fragility that widens the agent-to-gold gap from 7% to 56% on longer trajectories.

Significance. If the tasks and trajectories are representative, DORA provides a valuable, large-scale testbed that integrates multi-modal perception with planning and reporting in high-stakes scenarios, going beyond prior isolated remote-sensing or generic tool-use evaluations. The explicit provision of replayable gold trajectories, real disaster events, and dimensionally structured tasks is a strength that enables reproducible diagnosis of failure modes such as sensor-modality mismatch and length-dependent composition errors. These empirical patterns could usefully inform agent design for operational reliability.

major comments (2)
  1. [§4] §4 (Benchmark Construction): The claim that the 515 tasks and 3,500 trajectories 'faithfully sample the end-to-end disaster-response workflow' is load-bearing for all headline results on bottlenecks and fragility, yet the section supplies no inter-annotator agreement, coverage statistics across the 10 disaster types, or validation against after-action reports. Without these, it is impossible to assess selection bias toward well-structured, fully observable scenarios.
  2. [§5] §5 (Evaluation and Results): The reported tool-selection and argument-grounding bottlenecks rest on accuracy metrics whose exact definitions (e.g., partial credit for argument grounding, handling of multi-step trajectory replay) are not fully specified; this directly affects interpretation of the 1.08-4.40% hint gains and the 7%-to-56% gap widening.
minor comments (2)
  1. [Tables/Figures] Table 1 and Figure 2: axis labels and legend entries for the five task dimensions could be expanded with one-sentence operational definitions to improve readability for readers outside geospatial AI.
  2. [§2] §2 (Related Work): The comparison to prior benchmarks would benefit from a brief quantitative contrast (e.g., number of tools or trajectory length) rather than qualitative description alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and positive assessment of DORA's potential value as a testbed. We address each major comment below with point-by-point responses and have revised the manuscript where feasible to enhance transparency and rigor.

read point-by-point responses
  1. Referee: [§4] §4 (Benchmark Construction): The claim that the 515 tasks and 3,500 trajectories 'faithfully sample the end-to-end disaster-response workflow' is load-bearing for all headline results on bottlenecks and fragility, yet the section supplies no inter-annotator agreement, coverage statistics across the 10 disaster types, or validation against after-action reports. Without these, it is impossible to assess selection bias toward well-structured, fully observable scenarios.

    Authors: We thank the referee for highlighting this important aspect of benchmark validity. The 515 tasks were authored iteratively by a team of domain experts with direct experience in emergency operations, drawing from the 45 real-world events, and each gold trajectory was verified for operational realism. In the revised manuscript, we have added explicit coverage statistics in §4 (new Table 2) detailing task and event distribution across all 10 disaster types, along with a description of the expert review process. We have also expanded the limitations discussion to address potential selection bias toward observable scenarios. However, because task development was collaborative and consensus-driven rather than independent parallel annotations, traditional inter-annotator agreement metrics were not collected; we have noted this methodological choice explicitly. Direct quantitative mapping to after-action reports was not performed due to variability in public report formats, but we have clarified how real-event grounding and replayable trajectories support workflow fidelity. These additions allow better assessment of representativeness while preserving the benchmark's contributions. revision: partial

  2. Referee: [§5] §5 (Evaluation and Results): The reported tool-selection and argument-grounding bottlenecks rest on accuracy metrics whose exact definitions (e.g., partial credit for argument grounding, handling of multi-step trajectory replay) are not fully specified; this directly affects interpretation of the 1.08-4.40% hint gains and the 7%-to-56% gap widening.

    Authors: We appreciate the referee's call for greater precision in metric definitions, which we agree is essential for interpreting the reported bottlenecks. In the revised §5, we have added a dedicated subsection with formal definitions: tool selection accuracy is the exact-match rate to the gold tool at each step; argument grounding uses per-argument partial credit (1.0 for exact match, 0.5 for correct type with value mismatch, 0 otherwise) averaged across arguments; multi-step trajectories are evaluated via sequential replay in the environment, requiring full sequence fidelity with only floating-point tolerance for geospatial parameters. These clarifications confirm that the small hint gains reflect fundamental selection and grounding difficulties rather than evaluation artifacts, and the length-dependent gap widening (7% to 56%) arises from compounding compositional errors. We have also added pseudocode and worked examples in a new appendix to ensure full reproducibility and transparent interpretation of all results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction and evaluation

full rationale

The paper constructs an empirical benchmark (DORA) consisting of 515 expert-authored tasks and 3,500 gold trajectories across five operational dimensions, then evaluates 13 LLMs on tool-use accuracy, compositional fragility, and scaffold sensitivity. No mathematical derivations, fitted parameters, or self-referential equations appear in the abstract or described methodology; results are direct measurements against the fixed gold trajectories rather than quantities defined by the paper's own outputs. Self-citations, if present, are not load-bearing for any central claim, and the work contains no uniqueness theorems, ansatzes, or renamings of known results that reduce to prior author work by construction. The evaluation pipeline is externally falsifiable via the released tasks and trajectories.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is a curated benchmark and evaluation rather than a derivation; no numerical parameters are fitted to data and no new physical or mathematical entities are postulated.

axioms (1)
  • domain assumption Expert-authored tasks and gold trajectories faithfully represent operational disaster response requirements
    Validity of the benchmark and all downstream claims rests on this premise stated in the abstract.

pith-pipeline@v0.9.0 · 5653 in / 1341 out tokens · 43810 ms · 2026-05-13T01:20:42.312754+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 2 internal anchors

  1. [1]

    Implementing equitable wildfire response plans,

    J. Xu, D. J. Nair, and S. T. Waller, “Implementing equitable wildfire response plans,”Science, vol. 388, no. 6743, pp. 158– 159, 2025

  2. [2]

    Effects of a natural disaster on mortality risks over the longer term,

    E. Frankenberg, C. Sumantri, and D. Thomas, “Effects of a natural disaster on mortality risks over the longer term,”Nature sustainability, vol. 3, no. 8, pp. 614–619, 2020. 10

  3. [3]

    Executable code actions elicit better llm agents,

    X. Wang, Y. Chen, L. Yuan, Y. Zhang, Y. Li, H. Peng, and H. Ji, “Executable code actions elicit better llm agents,” in Forty-first International Conference on Machine Learning, 2024

  4. [4]

    Swe-agent: Agent-computer interfaces enable automated software engineering,

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated software engineering,”Advances in Neural Information Processing Systems, vol. 37, pp. 50528–50652, 2024

  5. [5]

    Agentbench: Evaluating LLMs as agents,

    X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y. Su, H. Sun, M. Huang, Y. Dong, and J. Tang, “Agentbench: Evaluating LLMs as agents,” inThe Twelfth International Conference on Learning Representations, 2024

  6. [6]

    Earth-agent: Unlocking the full landscape of earth observation with agents,

    P. Feng, Z. Lv, J. Ye, X. Wang, X. Huo, J. Yu, W. Xu, W. Zhang, L. BAI, C. He, and W. Li, “Earth-agent: Unlocking the full landscape of earth observation with agents,” inThe Fourteenth International Conference on Learning Representations, 2026

  7. [7]

    Openearthagent: Aunifiedframeworkfortool-augmentedgeospatialagents,

    A. Shabbir, M. U. Sheikh, M. A. Munir, H. Debary, M. Fiaz, M. Z. Zaheer, P. Fraccaro, F. S. Khan, M. H. Khan, X. X. Zhu,etal.,“Openearthagent: Aunifiedframeworkfortool-augmentedgeospatialagents,”arXivpreprintarXiv:2602.17665, 2026

  8. [8]

    Openearth-agent: From tool calling to tool creation for open-environment earth observation,

    S. Zhao, F. Liu, X. Zhang, H. Chen, X. Gu, Z. Jiang, F. Ling, B. Fei, W. Zhang, J. Wang,et al., “Openearth-agent: From tool calling to tool creation for open-environment earth observation,”arXiv preprint arXiv:2603.22148, 2026

  9. [9]

    React: Synergizingreasoningandactinginlanguage models,

    S.Yao,J.Zhao,D.Yu,N.Du,I.Shafran,K.Narasimhan,andY.Cao,“React: Synergizingreasoningandactinginlanguage models,” inInternational Conference on Learning Representations (ICLR), 2023

  10. [10]

    Expel: Llm agents are experiential learners,

    A. Zhao, D. Huang, Q. Xu, M. Lin, Y.-J. Liu, and G. Huang, “Expel: Llm agents are experiential learners,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 19632–19642, 2024

  11. [11]

    Autoguide: Automated generation and selection of context-aware guidelines for large language model agents,

    Y. Fu, D.-K. Kim, J. Kim, S. Sohn, L. Logeswaran, K. Bae, and H. Lee, “Autoguide: Automated generation and selection of context-aware guidelines for large language model agents,”Advances in Neural Information Processing Systems, vol. 37, pp. 119919–119948, 2024

  12. [12]

    Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen - Yu Lee, and Tomas Pfister

    S.Ouyang,J.Yan,I.Hsu,Y.Chen,K.Jiang,Z.Wang,R.Han,L.T.Le,S.Daruki,X.Tang,etal.,“Reasoningbank: Scaling agent self-evolving with reasoning memory,”arXiv preprint arXiv:2509.25140, 2025

  13. [13]

    Webarena: A realistic web environment for building autonomous agents,

    S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y. Bisk, D. Fried, U. Alon, and G. Neubig, “Webarena: A realistic web environment for building autonomous agents,” inThe Twelfth International Conference on Learning Representations, 2024

  14. [14]

    Osworld: Benchmarkingmul- timodal agents for open-ended tasks in real computer environments,

    T.Xie,D.Zhang,J.Chen,X.Li,S.Zhao,R.Cao,T.J.Hua,Z.Cheng,D.Shin,F.Lei,etal.,“Osworld: Benchmarkingmul- timodal agents for open-ended tasks in real computer environments,”Advances in Neural Information Processing Systems, vol. 37, pp. 52040–52094, 2024

  15. [15]

    SWE-bench multimodal: Do AI systems generalize to visual software domains?,

    J. Yang, C. E. Jimenez, A. L. Zhang, K. Lieret, J. Yang, X. Wu, O. Press, N. Muennighoff, G. Synnaeve, K. R. Narasimhan, D. Yang, S. Wang, and O. Press, “SWE-bench multimodal: Do AI systems generalize to visual software domains?,” inThe Thirteenth International Conference on Learning Representations, 2025

  16. [16]

    Octobench: Benchmarking scaffold-aware instruction following in repository-grounded agentic coding,

    D. Ding, S. Liu, E. Yang, J. Lin, Z. Chen, S. Dou, H. Guo, W. Cheng, P. Zhao, C. Xiao,et al., “Octobench: Benchmarking scaffold-aware instruction following in repository-grounded agentic coding,”arXiv preprint arXiv:2601.10343, 2026

  17. [17]

    Featurebench: Benchmarking agentic coding for complex feature development,

    Q.Zhou,J.Zhang,H.Wang,R.Hao,J.Wang,M.Han,Y.Yang,S.Wu,F.Pan,L.Fan,D.Tu,andZ.Zhang,“Featurebench: Benchmarking agentic coding for complex feature development,” inThe Fourteenth International Conference on Learning Representations, 2026

  18. [18]

    GAIA: a benchmark for general AI assistants,

    G. Mialon, C. Fourrier, T. Wolf, Y. LeCun, and T. Scialom, “GAIA: a benchmark for general AI assistants,” inThe Twelfth International Conference on Learning Representations, 2024

  19. [19]

    Gta: a benchmark for general tool agents,

    J. Wang, Z. Ma, Y. Li, S. Zhang, C. Chen, K. Chen, and X. Le, “Gta: a benchmark for general tool agents,”Advances in Neural Information Processing Systems, vol. 37, pp. 75749–75790, 2024. 11

  20. [20]

    m & m’s: A benchmark to evaluate tool-use for m ulti-step m ulti-modal tasks,

    Z. Ma, W. Huang, J. Zhang, T. Gupta, and R. Krishna, “m & m’s: A benchmark to evaluate tool-use for m ulti-step m ulti-modal tasks,” inEuropean Conference on Computer Vision, pp. 18–34, Springer, 2024

  21. [21]

    Geollm-engine: Arealisticenvironmentforbuildinggeospatialcopilots,

    S.Singh,M.Fore,andD.Stamoulis,“Geollm-engine: Arealisticenvironmentforbuildinggeospatialcopilots,”inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 585–594, 2024

  22. [22]

    ThinkGeo: Evaluating tool-augmented agents for remote sensing tasks,

    A. Shabbir, M. A. Munir, A. Dudhane, M. U. Sheikh, M. H. Khan, P. Fraccaro, J. B. Moreno, F. S. Khan, and S. Khan, “Thinkgeo: Evaluating tool-augmented agents for remote sensing tasks,”arXiv preprint arXiv:2505.23752, 2025

  23. [23]

    Towards llm agents for earth observation,

    C. H. Kao, W. Zhao, S. Revankar, S. Speas, S. Bhagat, R. Datta, C. P. Phoo, U. Mall, C. Vondrick, K. Bala,et al., “Towards llm agents for earth observation,”arXiv preprint arXiv:2504.12110, 2025

  24. [24]

    RS-Agent: Automating remote sensing tasks through intelligent agent,

    W.Xu, Z.Yu, B.Mu, Z.Wei, Y.Zhang, G.Li, J.Wang, andM.Peng, “Rs-agent: Automatingremotesensingtasksthrough intelligent agent,”arXiv preprint arXiv:2406.07089, 2024

  25. [25]

    Change-agent: Toward interactive comprehensive remote sensing change interpretation and analysis,

    C. Liu, K. Chen, H. Zhang, Z. Qi, Z. Zou, and Z. Shi, “Change-agent: Toward interactive comprehensive remote sensing change interpretation and analysis,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–16, 2024

  26. [26]

    xbd: A dataset for assessing building damage from satellite imagery,

    R. Gupta, R. Hosfelt, S. Sajeev, N. Patel, B. Goodman, J. Doshi, E. Heim, H. Choset, and M. Gaston, “xbd: A dataset for assessing building damage from satellite imagery,”arXiv preprint arXiv:1911.09296, 2019

  27. [27]

    Disasterm3: Aremote sensing vision-language dataset for disaster damage assessment and response,

    J.Wang,W.Xuan,H.Qi,Z.Liu,K.Liu,Y.Wu,H.Chen,J.Song,J.Xia,Z.Zheng,andN.Yokoya,“Disasterm3: Aremote sensing vision-language dataset for disaster damage assessment and response,” inProceedings of the Neural Information Processing Systems, 2025

  28. [28]

    Bright: A globally distributed multimodal building damage assessment dataset with very-high-resolution for all-weather disaster response,

    H. Chen, J. Song, O. Dietrich, C. Broni-Bediako, W. Xuan, J. Wang, X. Shao, Y. Wei, J. Xia, C. Lan,et al., “Bright: A globally distributed multimodal building damage assessment dataset with very-high-resolution for all-weather disaster response,”Earth System Science Data, vol. 17, no. 11, pp. 6217–6253, 2025

  29. [29]

    National agriculture imagery program (NAIP)

    USDA Farm Service Agency, “National agriculture imagery program (NAIP).”https://naip-usdaonline.hub. arcgis.com/, 2022. Accessed: 2025

  30. [30]

    Maxar open data program

    Maxar Technologies, “Maxar open data program.”https://www.maxar.com/open-data, 2024. Accessed: 2025

  31. [31]

    The outcome of the 2022 landslide4sense competition: Advanced landslide detection from multisource satellite imagery,

    O. Ghorbanzadeh, Y. Xu, H. Zhao, J. Wang, Y. Zhong, D. Zhao, Q. Zang, S. Wang, F. Zhang, Y. Shi,et al., “The outcome of the 2022 landslide4sense competition: Advanced landslide detection from multisource satellite imagery,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 9927–9942, 2022

  32. [32]

    Cross-domain landslide mapping from large-scale remote sensing images us- ing prototype-guided domain-aware progressive representation learning,

    X. Zhang, W. Yu, M.-O. Pun, and W. Shi, “Cross-domain landslide mapping from large-scale remote sensing images us- ing prototype-guided domain-aware progressive representation learning,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 197, pp. 1–17, 2023

  33. [33]

    Rescuenet: A high resolution uav semantic segmentation dataset for natural disaster damage assessment,

    M. Rahnemoonfar, T. Chowdhury, and R. Murphy, “Rescuenet: A high resolution uav semantic segmentation dataset for natural disaster damage assessment,”Scientific data, vol. 10, no. 1, p. 913, 2023

  34. [34]

    Crasar-u-droids: A large scale benchmark dataset for building alignment and damage assessment in georectified suas imagery,

    T. Manzini, P. Perali, R. Karnik, and R. Murphy, “Crasar-u-droids: A large scale benchmark dataset for building alignment and damage assessment in georectified suas imagery,”arXiv preprint arXiv:2407.17673, 2024

  35. [35]

    Openearthmap: Abenchmarkdatasetforglobalhigh-resolutionland cover mapping,

    J.Xia,N.Yokoya,B.Adriano,andC.Broni-Bediako,“Openearthmap: Abenchmarkdatasetforglobalhigh-resolutionland cover mapping,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6254–6264, 2023

  36. [36]

    Land surface temperature and climate data

    Japan Meteorological Agency, “Land surface temperature and climate data.”https://www.jma.go.jp/jma/ indexe.html, 2025. Accessed: 2025

  37. [37]

    Planet dump retrieved from https://planet.osm.org

    OpenStreetMap contributors, “Planet dump retrieved from https://planet.osm.org.”https://www.openstreetmap. org, 2025. Data licensed under ODbL

  38. [38]

    Our world in data

    M.Roser,H.Ritchie,E.Ortiz-Ospina,L.Rodés-Guirao,J.Hasell,B.Macdonald,D.Beltekian,E.Mathieu,andC.Giattino, “Our world in data.”https://ourworldindata.org, 2025. Licensed under CC BY. Accessed: 2025. 12

  39. [39]

    UNOSAT – United Nations Satellite Centre emergency mapping service

    United Nations Institute for Training and Research (UNITAR), “UNOSAT – United Nations Satellite Centre emergency mapping service.”https://unosat.org/services/, 2024. Accessed: 2026-04-06

  40. [40]

    Manual for CEMS-rapid mapping products,

    I. Joubert-Boitat, A. Wania, and S. Dalmasso, “Manual for CEMS-rapid mapping products,” Tech. Rep. JRC121741, Euro- pean Commission, Joint Research Centre (JRC), 2020

  41. [41]

    National urban search and rescue (US&R) response system: Rescue field operations guide,

    Federal Emergency Management Agency (FEMA), “National urban search and rescue (US&R) response system: Rescue field operations guide,” Tech. Rep. US&R-23-FG, U.S. Department of Homeland Security, 2008

  42. [42]

    ThisisOCHA

    UnitedNationsOfficefortheCoordinationofHumanitarianAffairs(OCHA),“ThisisOCHA.”https://www.unocha. org/ocha, 2024. Established by UN General Assembly Resolution 46/182 (1991)

  43. [43]

    Hazusinventorytechnicalmanual,

    FederalEmergencyManagementAgency,“Hazusinventorytechnicalmanual,”Tech.Rep.Hazus6.1,DepartmentofHome- land Security, FEMA, Washington, D.C., 2024

  44. [44]

    The Multi-Temporal Urban Development SpaceNet Dataset,

    A. Van Etten, D. Hogan, J. M. Manso, J. Shermeyer, N. Weir, and R. Lewis, “The Multi-Temporal Urban Development SpaceNet Dataset,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6398–6407, 2021

  45. [45]

    Toolllm: Facilitatinglargelanguage models to master 16000+ real-world apis,

    Y.Qin,S.Liang,Y.Ye,K.Zhu,L.Yan,Y.Lu,Y.Lin,X.Cong,X.Tang,B.Qian,etal.,“Toolllm: Facilitatinglargelanguage models to master 16000+ real-world apis,” inThe Twelfth International Conference on Learning Representations, 2024

  46. [46]

    GPT-5.4 thinking system card,

    OpenAI, “GPT-5.4 thinking system card,” tech. rep., OpenAI, March 2026

  47. [47]

    Claude sonnet 4.6 system card,

    Anthropic, “Claude sonnet 4.6 system card,” tech. rep., Anthropic, February 2026

  48. [48]

    Introducing Gemini 3 flash: Benchmarks, global availability

    Google DeepMind, “Introducing Gemini 3 flash: Benchmarks, global availability.”https://blog.google/ products/gemini/gemini-3-flash/, 2026

  49. [49]

    Grok 4.1 model card,

    xAI, “Grok 4.1 model card,” tech. rep., xAI, November 2025

  50. [50]

    Qwen3.5: Towards native multimodal agents

    Qwen Team, “Qwen3.5: Towards native multimodal agents.”https://qwen.ai/blog?id=qwen3.5, 2026

  51. [51]

    MiMo-V2-Pro

    LLM-Core, Xiaomi, “MiMo-V2-Pro.”https://mimo.xiaomi.com/mimo-v2-pro, 2026. API model card, re- leased March 18, 2026

  52. [52]

    Step 3.5 Flash: Fast enough to think, reliable enough to act

    StepFun, “Step 3.5 Flash: Fast enough to think, reliable enough to act.”https://static.stepfun.com/blog/ step-3.5-flash/, 2025. Technical blog and model card

  53. [53]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    DeepSeek-AI, “DeepSeek-V3.2: Pushing the frontier of open large language models,”arXiv preprint arXiv:2512.02556, 2025

  54. [54]

    Gemma 4: Byte for byte, the most capable open models,

    Gemma Team and Google DeepMind, “Gemma 4: Byte for byte, the most capable open models,” tech. rep., Google Deep- Mind, April 2026

  55. [55]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI, “gpt-oss-120b & gpt-oss-20b model card,”arXiv preprint arXiv:2508.10925, 2025

  56. [56]

    MiniMax-M2.7: A self-evolving agent model,

    MiniMax AI, “MiniMax-M2.7: A self-evolving agent model,” tech. rep., MiniMax AI, March 2026

  57. [57]

    Qwen3-VL technical report,

    S. Bai, Y. Cai,et al., “Qwen3-VL technical report,” 2025

  58. [58]

    Plan-then-execute: An empirical study of user trust and team performance when usingllmagentsasadailyassistant,

    G. He, G. Demartini, and U. Gadiraju, “Plan-then-execute: An empirical study of user trust and team performance when usingllmagentsasadailyassistant,”inProceedingsofthe2025CHIConferenceonHumanFactorsinComputingSystems, pp. 1–22, 2025

  59. [59]

    Reflexion: Language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,”Advances in neural information processing systems, vol. 36, pp. 8634–8652, 2023

  60. [60]

    arXiv preprint , year =

    B. Xu, Z. Peng, B. Lei, S. Mukherjee, Y. Liu, and D. Xu, “Rewoo: Decoupling reasoning from observations for efficient augmented language models,”arXiv preprint arXiv:2305.18323, 2023. 13