arxiv: 2605.11633 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations

Heli Qi, Hongruixuan Chen, Junjue Wang, Junshi Xia, Kunyi Liu, Naoto Yokoya, Pengyu Dai, Stefano Ermon, Weihao Xuan, Zhuo Zheng

Pith reviewed 2026-05-13 01:20 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsdisaster responsegeospatial reasoningagent benchmarktool useemergency operationsmulti-modal dataspatial planning

0 comments

The pith

LLM agents for disaster response are doubly limited by tool selection failures and rapid performance drops on longer compositional pipelines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the DORA benchmark, which supplies 515 expert-written tasks drawn from 45 real disasters together with 3,500 verified tool trajectories that exercise a 108-tool library over optical, SAR, elevation, and vector data. Evaluation of thirteen frontier models shows that agents repeatedly mis-select tools or ground their arguments incorrectly, and that supplying the correct tool order improves accuracy by only a few percentage points. Accuracy also declines sharply as the number of required tool calls grows, exposing a basic inability to maintain coherent multi-step geospatial reasoning through an entire operational workflow. These results matter because responders need agents that can fuse multi-sensor inputs, plan evacuations, and generate reports without constant human intervention.

Core claim

The DORA benchmark demonstrates that current LLM agents cannot reliably execute end-to-end disaster-response pipelines: they exhibit domain-specific grounding errors in damage semantics and sensor modalities, remain bottlenecked by both tool selection and argument generation even when given gold tool sequences, and suffer widening gaps relative to expert trajectories as pipeline length increases from short to long sequences.

What carries the argument

The DORA benchmark, a collection of 515 tasks and 3,500 expert-verified gold trajectories that require agents to compose calls from a heterogeneous 108-tool geospatial library spanning multi-temporal imagery and vector layers.

Load-bearing premise

That the 515 expert-authored tasks and 3,500 gold trajectories accurately capture the full operational disaster-response pipeline without selection bias or oversimplification of real-world constraints.

What would settle it

Running the same 13 models on a fresh collection of live disaster events using the identical 108-tool library and measuring whether the observed tool-selection and length-dependent gaps remain or shrink.

Figures

Figures reproduced from arXiv: 2605.11633 by Heli Qi, Hongruixuan Chen, Junjue Wang, Junshi Xia, Kunyi Liu, Naoto Yokoya, Pengyu Dai, Stefano Ermon, Weihao Xuan, Zhuo Zheng.

**Figure 1.** Figure 1: Representative task examples in DORA across different disaster operational categories. Each example illustrates the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: DORA aggregates multi-modal data from 10 open-source databases into 45 disaster events distributed across five [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Distribution of 515 tasks across five an [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Complexity and tool-usage profiles across DORA’s five dimensions. (a) Trajectory length grows from [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Our annotation pipeline: (1) experts author queries and symbolic trajectories; (2) a deterministic replay engine resolves [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Failure modes in disaster domain [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Agent performances (%) decomposed by input modality configuration. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Agents vs. gold trajectory: compositional [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

read the original abstract

Operational disaster response goes beyond damage assessment, requiring responders to integrate multi-sensor signals, reason over road networks, populations and key facilities, plan evacuations, and produce actionable reports. However, prior work largely isolates remote-sensing perception or evaluates generic tool use, leaving the end-to-end workflows of emergency operations underexplored. In this paper, we introduce Disaster Operational Response Agent benchmark (DORA), the first agentic benchmark for end-to-end disaster response: 515 expert-authored tasks across 45 real-world disaster events spanning 10 types, paired with expert-verified, replayable gold trajectories totaling 3,500 tool-call steps. Tasks span five dimensions that cover the operational disaster-response pipeline: disaster perception, spatial relational analysis, rescue and evacuation planning, temporal evolution reasoning, and multi-modal report synthesis. Agents compose calls from a 108-tool MCP library over heterogeneous geospatial data: optical, SAR, and multi-spectral imagery across single-, bi-, and multi-temporal sequences (0.015-10m GSD), complemented by elevation and social vector layers. We comprehensively evaluate 13 frontier LLMs on our benchmark, revealing three persistent challenges: 1) disaster-domain grounding exposes unique failure modes (damage-semantic grounding, sensor-modality mismatch, and disaster-pipeline composition); 2) agents are doubly bottlenecked by tool selection and argument grounding, where gold tool-order hints improve accuracy by only 1.08-4.40%, and alternative scaffolds yield at most a 3.24% gain; 3) compositional fragility scales with trajectory length, the agent-to-gold gap widening from 7% to 56% on long pipelines. DORA establishes a rigorous testbed for operationally reliable disaster-response agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DORA is a new benchmark for end-to-end disaster agent workflows that shows tool selection and long-chain composition as clear bottlenecks, though its task set may be cleaner than real operations.

read the letter

The main point is that this paper builds DORA, the first benchmark that runs LLM agents through complete disaster response pipelines using real multi-sensor geospatial data instead of isolated perception or toy tools. They created 515 expert tasks from 45 events across 10 disaster types, backed by 3500 gold trajectories and a 108-tool library covering optical, SAR, elevation, and vector layers. The evaluation of 13 models turns up consistent problems: agents fail at picking the right tools and grounding arguments, gold order hints lift accuracy only 1-4%, other scaffolds add at most 3%, and the gap to gold trajectories grows from 7% to 56% as sequences lengthen. Those scaling numbers and the listed failure modes like sensor mismatch and pipeline composition are the concrete contributions here. Prior agent benchmarks stayed narrower, so the integration of full operational steps is what stands out as new. The work is straightforward about the gaps and reports the results without overclaiming fixes. The soft spot is representativeness. The abstract calls the tasks expert-verified, but there is no mention of inter-annotator stats, coverage across disaster subtypes, or checks against actual after-action reports. If the scenarios avoid sensor noise, incomplete data, or coordination messiness, then the measured fragility might not translate directly to field conditions. That is a standard concern for any new benchmark and does not sink the paper, but it is worth pressing in review. This is useful for researchers working on agents for geospatial or high-stakes domains. Anyone testing tool-use reliability or building disaster-specific systems would get direct value from the task taxonomy and the failure breakdowns. The setup is grounded enough and the results sharp enough that it deserves a serious referee rather than a desk reject. I would send it out, with requests for more detail on task sampling and any plans to release the trajectories and tool library.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DORA, the first agentic benchmark for end-to-end disaster response. It comprises 515 expert-authored tasks spanning 45 real-world events across 10 disaster types, paired with 3,500 expert-verified replayable gold trajectories that invoke a 108-tool library over heterogeneous geospatial data (optical/SAR/multi-spectral imagery, elevation, and social vectors). The work evaluates 13 frontier LLMs across five operational dimensions (perception, spatial analysis, rescue planning, temporal reasoning, report synthesis) and reports three challenges: domain-specific grounding failures, double bottlenecks in tool selection and argument grounding (with gold-order hints yielding only 1.08-4.40% gains and scaffolds at most 3.24%), and compositional fragility that widens the agent-to-gold gap from 7% to 56% on longer trajectories.

Significance. If the tasks and trajectories are representative, DORA provides a valuable, large-scale testbed that integrates multi-modal perception with planning and reporting in high-stakes scenarios, going beyond prior isolated remote-sensing or generic tool-use evaluations. The explicit provision of replayable gold trajectories, real disaster events, and dimensionally structured tasks is a strength that enables reproducible diagnosis of failure modes such as sensor-modality mismatch and length-dependent composition errors. These empirical patterns could usefully inform agent design for operational reliability.

major comments (2)

[§4] §4 (Benchmark Construction): The claim that the 515 tasks and 3,500 trajectories 'faithfully sample the end-to-end disaster-response workflow' is load-bearing for all headline results on bottlenecks and fragility, yet the section supplies no inter-annotator agreement, coverage statistics across the 10 disaster types, or validation against after-action reports. Without these, it is impossible to assess selection bias toward well-structured, fully observable scenarios.
[§5] §5 (Evaluation and Results): The reported tool-selection and argument-grounding bottlenecks rest on accuracy metrics whose exact definitions (e.g., partial credit for argument grounding, handling of multi-step trajectory replay) are not fully specified; this directly affects interpretation of the 1.08-4.40% hint gains and the 7%-to-56% gap widening.

minor comments (2)

[Tables/Figures] Table 1 and Figure 2: axis labels and legend entries for the five task dimensions could be expanded with one-sentence operational definitions to improve readability for readers outside geospatial AI.
[§2] §2 (Related Work): The comparison to prior benchmarks would benefit from a brief quantitative contrast (e.g., number of tools or trajectory length) rather than qualitative description alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and positive assessment of DORA's potential value as a testbed. We address each major comment below with point-by-point responses and have revised the manuscript where feasible to enhance transparency and rigor.

read point-by-point responses

Referee: [§4] §4 (Benchmark Construction): The claim that the 515 tasks and 3,500 trajectories 'faithfully sample the end-to-end disaster-response workflow' is load-bearing for all headline results on bottlenecks and fragility, yet the section supplies no inter-annotator agreement, coverage statistics across the 10 disaster types, or validation against after-action reports. Without these, it is impossible to assess selection bias toward well-structured, fully observable scenarios.

Authors: We thank the referee for highlighting this important aspect of benchmark validity. The 515 tasks were authored iteratively by a team of domain experts with direct experience in emergency operations, drawing from the 45 real-world events, and each gold trajectory was verified for operational realism. In the revised manuscript, we have added explicit coverage statistics in §4 (new Table 2) detailing task and event distribution across all 10 disaster types, along with a description of the expert review process. We have also expanded the limitations discussion to address potential selection bias toward observable scenarios. However, because task development was collaborative and consensus-driven rather than independent parallel annotations, traditional inter-annotator agreement metrics were not collected; we have noted this methodological choice explicitly. Direct quantitative mapping to after-action reports was not performed due to variability in public report formats, but we have clarified how real-event grounding and replayable trajectories support workflow fidelity. These additions allow better assessment of representativeness while preserving the benchmark's contributions. revision: partial
Referee: [§5] §5 (Evaluation and Results): The reported tool-selection and argument-grounding bottlenecks rest on accuracy metrics whose exact definitions (e.g., partial credit for argument grounding, handling of multi-step trajectory replay) are not fully specified; this directly affects interpretation of the 1.08-4.40% hint gains and the 7%-to-56% gap widening.

Authors: We appreciate the referee's call for greater precision in metric definitions, which we agree is essential for interpreting the reported bottlenecks. In the revised §5, we have added a dedicated subsection with formal definitions: tool selection accuracy is the exact-match rate to the gold tool at each step; argument grounding uses per-argument partial credit (1.0 for exact match, 0.5 for correct type with value mismatch, 0 otherwise) averaged across arguments; multi-step trajectories are evaluated via sequential replay in the environment, requiring full sequence fidelity with only floating-point tolerance for geospatial parameters. These clarifications confirm that the small hint gains reflect fundamental selection and grounding difficulties rather than evaluation artifacts, and the length-dependent gap widening (7% to 56%) arises from compounding compositional errors. We have also added pseudocode and worked examples in a new appendix to ensure full reproducibility and transparent interpretation of all results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction and evaluation

full rationale

The paper constructs an empirical benchmark (DORA) consisting of 515 expert-authored tasks and 3,500 gold trajectories across five operational dimensions, then evaluates 13 LLMs on tool-use accuracy, compositional fragility, and scaffold sensitivity. No mathematical derivations, fitted parameters, or self-referential equations appear in the abstract or described methodology; results are direct measurements against the fixed gold trajectories rather than quantities defined by the paper's own outputs. Self-citations, if present, are not load-bearing for any central claim, and the work contains no uniqueness theorems, ansatzes, or renamings of known results that reduce to prior author work by construction. The evaluation pipeline is externally falsifiable via the released tasks and trajectories.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is a curated benchmark and evaluation rather than a derivation; no numerical parameters are fitted to data and no new physical or mathematical entities are postulated.

axioms (1)

domain assumption Expert-authored tasks and gold trajectories faithfully represent operational disaster response requirements
Validity of the benchmark and all downstream claims rests on this premise stated in the abstract.

pith-pipeline@v0.9.0 · 5653 in / 1341 out tokens · 43810 ms · 2026-05-13T01:20:42.312754+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
515 expert-authored tasks... 108-tool MCP library... Tool-Any-Order, Tool-In-Order, Tool-Exact-Match, Parameter Accuracy... disaster-domain grounding exposes unique failure modes
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
compositional fragility scales with trajectory length... agent-to-gold gap widening from 7% to 56%

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 2 internal anchors

[1]

Implementing equitable wildfire response plans,

J. Xu, D. J. Nair, and S. T. Waller, “Implementing equitable wildfire response plans,”Science, vol. 388, no. 6743, pp. 158– 159, 2025

work page 2025
[2]

Effects of a natural disaster on mortality risks over the longer term,

E. Frankenberg, C. Sumantri, and D. Thomas, “Effects of a natural disaster on mortality risks over the longer term,”Nature sustainability, vol. 3, no. 8, pp. 614–619, 2020. 10

work page 2020
[3]

Executable code actions elicit better llm agents,

X. Wang, Y. Chen, L. Yuan, Y. Zhang, Y. Li, H. Peng, and H. Ji, “Executable code actions elicit better llm agents,” in Forty-first International Conference on Machine Learning, 2024

work page 2024
[4]

Swe-agent: Agent-computer interfaces enable automated software engineering,

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated software engineering,”Advances in Neural Information Processing Systems, vol. 37, pp. 50528–50652, 2024

work page 2024
[5]

Agentbench: Evaluating LLMs as agents,

X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y. Su, H. Sun, M. Huang, Y. Dong, and J. Tang, “Agentbench: Evaluating LLMs as agents,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[6]

Earth-agent: Unlocking the full landscape of earth observation with agents,

P. Feng, Z. Lv, J. Ye, X. Wang, X. Huo, J. Yu, W. Xu, W. Zhang, L. BAI, C. He, and W. Li, “Earth-agent: Unlocking the full landscape of earth observation with agents,” inThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[7]

Openearthagent: Aunifiedframeworkfortool-augmentedgeospatialagents,

A. Shabbir, M. U. Sheikh, M. A. Munir, H. Debary, M. Fiaz, M. Z. Zaheer, P. Fraccaro, F. S. Khan, M. H. Khan, X. X. Zhu,etal.,“Openearthagent: Aunifiedframeworkfortool-augmentedgeospatialagents,”arXivpreprintarXiv:2602.17665, 2026

work page arXiv 2026
[8]

Openearth-agent: From tool calling to tool creation for open-environment earth observation,

S. Zhao, F. Liu, X. Zhang, H. Chen, X. Gu, Z. Jiang, F. Ling, B. Fei, W. Zhang, J. Wang,et al., “Openearth-agent: From tool calling to tool creation for open-environment earth observation,”arXiv preprint arXiv:2603.22148, 2026

work page arXiv 2026
[9]

React: Synergizingreasoningandactinginlanguage models,

S.Yao,J.Zhao,D.Yu,N.Du,I.Shafran,K.Narasimhan,andY.Cao,“React: Synergizingreasoningandactinginlanguage models,” inInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[10]

Expel: Llm agents are experiential learners,

A. Zhao, D. Huang, Q. Xu, M. Lin, Y.-J. Liu, and G. Huang, “Expel: Llm agents are experiential learners,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 19632–19642, 2024

work page 2024
[11]

Autoguide: Automated generation and selection of context-aware guidelines for large language model agents,

Y. Fu, D.-K. Kim, J. Kim, S. Sohn, L. Logeswaran, K. Bae, and H. Lee, “Autoguide: Automated generation and selection of context-aware guidelines for large language model agents,”Advances in Neural Information Processing Systems, vol. 37, pp. 119919–119948, 2024

work page 2024
[12]

Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen - Yu Lee, and Tomas Pfister

S.Ouyang,J.Yan,I.Hsu,Y.Chen,K.Jiang,Z.Wang,R.Han,L.T.Le,S.Daruki,X.Tang,etal.,“Reasoningbank: Scaling agent self-evolving with reasoning memory,”arXiv preprint arXiv:2509.25140, 2025

work page arXiv 2025
[13]

Webarena: A realistic web environment for building autonomous agents,

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y. Bisk, D. Fried, U. Alon, and G. Neubig, “Webarena: A realistic web environment for building autonomous agents,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[14]

Osworld: Benchmarkingmul- timodal agents for open-ended tasks in real computer environments,

T.Xie,D.Zhang,J.Chen,X.Li,S.Zhao,R.Cao,T.J.Hua,Z.Cheng,D.Shin,F.Lei,etal.,“Osworld: Benchmarkingmul- timodal agents for open-ended tasks in real computer environments,”Advances in Neural Information Processing Systems, vol. 37, pp. 52040–52094, 2024

work page 2024
[15]

SWE-bench multimodal: Do AI systems generalize to visual software domains?,

J. Yang, C. E. Jimenez, A. L. Zhang, K. Lieret, J. Yang, X. Wu, O. Press, N. Muennighoff, G. Synnaeve, K. R. Narasimhan, D. Yang, S. Wang, and O. Press, “SWE-bench multimodal: Do AI systems generalize to visual software domains?,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[16]

Octobench: Benchmarking scaffold-aware instruction following in repository-grounded agentic coding,

D. Ding, S. Liu, E. Yang, J. Lin, Z. Chen, S. Dou, H. Guo, W. Cheng, P. Zhao, C. Xiao,et al., “Octobench: Benchmarking scaffold-aware instruction following in repository-grounded agentic coding,”arXiv preprint arXiv:2601.10343, 2026

work page arXiv 2026
[17]

Featurebench: Benchmarking agentic coding for complex feature development,

Q.Zhou,J.Zhang,H.Wang,R.Hao,J.Wang,M.Han,Y.Yang,S.Wu,F.Pan,L.Fan,D.Tu,andZ.Zhang,“Featurebench: Benchmarking agentic coding for complex feature development,” inThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[18]

GAIA: a benchmark for general AI assistants,

G. Mialon, C. Fourrier, T. Wolf, Y. LeCun, and T. Scialom, “GAIA: a benchmark for general AI assistants,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[19]

Gta: a benchmark for general tool agents,

J. Wang, Z. Ma, Y. Li, S. Zhang, C. Chen, K. Chen, and X. Le, “Gta: a benchmark for general tool agents,”Advances in Neural Information Processing Systems, vol. 37, pp. 75749–75790, 2024. 11

work page 2024
[20]

m & m’s: A benchmark to evaluate tool-use for m ulti-step m ulti-modal tasks,

Z. Ma, W. Huang, J. Zhang, T. Gupta, and R. Krishna, “m & m’s: A benchmark to evaluate tool-use for m ulti-step m ulti-modal tasks,” inEuropean Conference on Computer Vision, pp. 18–34, Springer, 2024

work page 2024
[21]

Geollm-engine: Arealisticenvironmentforbuildinggeospatialcopilots,

S.Singh,M.Fore,andD.Stamoulis,“Geollm-engine: Arealisticenvironmentforbuildinggeospatialcopilots,”inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 585–594, 2024

work page 2024
[22]

ThinkGeo: Evaluating tool-augmented agents for remote sensing tasks,

A. Shabbir, M. A. Munir, A. Dudhane, M. U. Sheikh, M. H. Khan, P. Fraccaro, J. B. Moreno, F. S. Khan, and S. Khan, “Thinkgeo: Evaluating tool-augmented agents for remote sensing tasks,”arXiv preprint arXiv:2505.23752, 2025

work page arXiv 2025
[23]

Towards llm agents for earth observation,

C. H. Kao, W. Zhao, S. Revankar, S. Speas, S. Bhagat, R. Datta, C. P. Phoo, U. Mall, C. Vondrick, K. Bala,et al., “Towards llm agents for earth observation,”arXiv preprint arXiv:2504.12110, 2025

work page arXiv 2025
[24]

RS-Agent: Automating remote sensing tasks through intelligent agent,

W.Xu, Z.Yu, B.Mu, Z.Wei, Y.Zhang, G.Li, J.Wang, andM.Peng, “Rs-agent: Automatingremotesensingtasksthrough intelligent agent,”arXiv preprint arXiv:2406.07089, 2024

work page arXiv 2024
[25]

Change-agent: Toward interactive comprehensive remote sensing change interpretation and analysis,

C. Liu, K. Chen, H. Zhang, Z. Qi, Z. Zou, and Z. Shi, “Change-agent: Toward interactive comprehensive remote sensing change interpretation and analysis,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–16, 2024

work page 2024
[26]

xbd: A dataset for assessing building damage from satellite imagery,

R. Gupta, R. Hosfelt, S. Sajeev, N. Patel, B. Goodman, J. Doshi, E. Heim, H. Choset, and M. Gaston, “xbd: A dataset for assessing building damage from satellite imagery,”arXiv preprint arXiv:1911.09296, 2019

work page arXiv 1911
[27]

Disasterm3: Aremote sensing vision-language dataset for disaster damage assessment and response,

J.Wang,W.Xuan,H.Qi,Z.Liu,K.Liu,Y.Wu,H.Chen,J.Song,J.Xia,Z.Zheng,andN.Yokoya,“Disasterm3: Aremote sensing vision-language dataset for disaster damage assessment and response,” inProceedings of the Neural Information Processing Systems, 2025

work page 2025
[28]

Bright: A globally distributed multimodal building damage assessment dataset with very-high-resolution for all-weather disaster response,

H. Chen, J. Song, O. Dietrich, C. Broni-Bediako, W. Xuan, J. Wang, X. Shao, Y. Wei, J. Xia, C. Lan,et al., “Bright: A globally distributed multimodal building damage assessment dataset with very-high-resolution for all-weather disaster response,”Earth System Science Data, vol. 17, no. 11, pp. 6217–6253, 2025

work page 2025
[29]

National agriculture imagery program (NAIP)

USDA Farm Service Agency, “National agriculture imagery program (NAIP).”https://naip-usdaonline.hub. arcgis.com/, 2022. Accessed: 2025

work page 2022
[30]

Maxar open data program

Maxar Technologies, “Maxar open data program.”https://www.maxar.com/open-data, 2024. Accessed: 2025

work page 2024
[31]

The outcome of the 2022 landslide4sense competition: Advanced landslide detection from multisource satellite imagery,

O. Ghorbanzadeh, Y. Xu, H. Zhao, J. Wang, Y. Zhong, D. Zhao, Q. Zang, S. Wang, F. Zhang, Y. Shi,et al., “The outcome of the 2022 landslide4sense competition: Advanced landslide detection from multisource satellite imagery,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 9927–9942, 2022

work page 2022
[32]

Cross-domain landslide mapping from large-scale remote sensing images us- ing prototype-guided domain-aware progressive representation learning,

X. Zhang, W. Yu, M.-O. Pun, and W. Shi, “Cross-domain landslide mapping from large-scale remote sensing images us- ing prototype-guided domain-aware progressive representation learning,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 197, pp. 1–17, 2023

work page 2023
[33]

Rescuenet: A high resolution uav semantic segmentation dataset for natural disaster damage assessment,

M. Rahnemoonfar, T. Chowdhury, and R. Murphy, “Rescuenet: A high resolution uav semantic segmentation dataset for natural disaster damage assessment,”Scientific data, vol. 10, no. 1, p. 913, 2023

work page 2023
[34]

Crasar-u-droids: A large scale benchmark dataset for building alignment and damage assessment in georectified suas imagery,

T. Manzini, P. Perali, R. Karnik, and R. Murphy, “Crasar-u-droids: A large scale benchmark dataset for building alignment and damage assessment in georectified suas imagery,”arXiv preprint arXiv:2407.17673, 2024

work page arXiv 2024
[35]

Openearthmap: Abenchmarkdatasetforglobalhigh-resolutionland cover mapping,

J.Xia,N.Yokoya,B.Adriano,andC.Broni-Bediako,“Openearthmap: Abenchmarkdatasetforglobalhigh-resolutionland cover mapping,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6254–6264, 2023

work page 2023
[36]

Land surface temperature and climate data

Japan Meteorological Agency, “Land surface temperature and climate data.”https://www.jma.go.jp/jma/ indexe.html, 2025. Accessed: 2025

work page 2025
[37]

Planet dump retrieved from https://planet.osm.org

OpenStreetMap contributors, “Planet dump retrieved from https://planet.osm.org.”https://www.openstreetmap. org, 2025. Data licensed under ODbL

work page 2025
[38]

Our world in data

M.Roser,H.Ritchie,E.Ortiz-Ospina,L.Rodés-Guirao,J.Hasell,B.Macdonald,D.Beltekian,E.Mathieu,andC.Giattino, “Our world in data.”https://ourworldindata.org, 2025. Licensed under CC BY. Accessed: 2025. 12

work page 2025
[39]

UNOSAT – United Nations Satellite Centre emergency mapping service

United Nations Institute for Training and Research (UNITAR), “UNOSAT – United Nations Satellite Centre emergency mapping service.”https://unosat.org/services/, 2024. Accessed: 2026-04-06

work page 2024
[40]

Manual for CEMS-rapid mapping products,

I. Joubert-Boitat, A. Wania, and S. Dalmasso, “Manual for CEMS-rapid mapping products,” Tech. Rep. JRC121741, Euro- pean Commission, Joint Research Centre (JRC), 2020

work page 2020
[41]

National urban search and rescue (US&R) response system: Rescue field operations guide,

Federal Emergency Management Agency (FEMA), “National urban search and rescue (US&R) response system: Rescue field operations guide,” Tech. Rep. US&R-23-FG, U.S. Department of Homeland Security, 2008

work page 2008
[42]

ThisisOCHA

UnitedNationsOfficefortheCoordinationofHumanitarianAffairs(OCHA),“ThisisOCHA.”https://www.unocha. org/ocha, 2024. Established by UN General Assembly Resolution 46/182 (1991)

work page 2024
[43]

Hazusinventorytechnicalmanual,

FederalEmergencyManagementAgency,“Hazusinventorytechnicalmanual,”Tech.Rep.Hazus6.1,DepartmentofHome- land Security, FEMA, Washington, D.C., 2024

work page 2024
[44]

The Multi-Temporal Urban Development SpaceNet Dataset,

A. Van Etten, D. Hogan, J. M. Manso, J. Shermeyer, N. Weir, and R. Lewis, “The Multi-Temporal Urban Development SpaceNet Dataset,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6398–6407, 2021

work page 2021
[45]

Toolllm: Facilitatinglargelanguage models to master 16000+ real-world apis,

Y.Qin,S.Liang,Y.Ye,K.Zhu,L.Yan,Y.Lu,Y.Lin,X.Cong,X.Tang,B.Qian,etal.,“Toolllm: Facilitatinglargelanguage models to master 16000+ real-world apis,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[46]

GPT-5.4 thinking system card,

OpenAI, “GPT-5.4 thinking system card,” tech. rep., OpenAI, March 2026

work page 2026
[47]

Claude sonnet 4.6 system card,

Anthropic, “Claude sonnet 4.6 system card,” tech. rep., Anthropic, February 2026

work page 2026
[48]

Introducing Gemini 3 flash: Benchmarks, global availability

Google DeepMind, “Introducing Gemini 3 flash: Benchmarks, global availability.”https://blog.google/ products/gemini/gemini-3-flash/, 2026

work page 2026
[49]

Grok 4.1 model card,

xAI, “Grok 4.1 model card,” tech. rep., xAI, November 2025

work page 2025
[50]

Qwen3.5: Towards native multimodal agents

Qwen Team, “Qwen3.5: Towards native multimodal agents.”https://qwen.ai/blog?id=qwen3.5, 2026

work page 2026
[51]

MiMo-V2-Pro

LLM-Core, Xiaomi, “MiMo-V2-Pro.”https://mimo.xiaomi.com/mimo-v2-pro, 2026. API model card, re- leased March 18, 2026

work page 2026
[52]

Step 3.5 Flash: Fast enough to think, reliable enough to act

StepFun, “Step 3.5 Flash: Fast enough to think, reliable enough to act.”https://static.stepfun.com/blog/ step-3.5-flash/, 2025. Technical blog and model card

work page 2025
[53]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI, “DeepSeek-V3.2: Pushing the frontier of open large language models,”arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Gemma 4: Byte for byte, the most capable open models,

Gemma Team and Google DeepMind, “Gemma 4: Byte for byte, the most capable open models,” tech. rep., Google Deep- Mind, April 2026

work page 2026
[55]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI, “gpt-oss-120b & gpt-oss-20b model card,”arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

MiniMax-M2.7: A self-evolving agent model,

MiniMax AI, “MiniMax-M2.7: A self-evolving agent model,” tech. rep., MiniMax AI, March 2026

work page 2026
[57]

Qwen3-VL technical report,

S. Bai, Y. Cai,et al., “Qwen3-VL technical report,” 2025

work page 2025
[58]

Plan-then-execute: An empirical study of user trust and team performance when usingllmagentsasadailyassistant,

G. He, G. Demartini, and U. Gadiraju, “Plan-then-execute: An empirical study of user trust and team performance when usingllmagentsasadailyassistant,”inProceedingsofthe2025CHIConferenceonHumanFactorsinComputingSystems, pp. 1–22, 2025

work page 2025
[59]

Reflexion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,”Advances in neural information processing systems, vol. 36, pp. 8634–8652, 2023

work page 2023
[60]

arXiv preprint , year =

B. Xu, Z. Peng, B. Lei, S. Mukherjee, Y. Liu, and D. Xu, “Rewoo: Decoupling reasoning from observations for efficient augmented language models,”arXiv preprint arXiv:2305.18323, 2023. 13

work page arXiv 2023