pith. machine review for the scientific record. sign in

arxiv: 2604.24919 · v2 · submitted 2026-04-27 · 💻 cs.CV

Recognition: unknown

Agentic AI for Remote Sensing: Technical Challenges and Research Directions

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords agentic AIremote sensingEarth Observationgeospatial workflowsmulti-step reasoningdesign principlesfailure modesposition paper
0
0 comments X

The pith

Earth Observation workflows impose structural challenges on generic agentic AI, necessitating new design principles for geospatial agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agentic AI promises coordinated reasoning over tools and data for complex tasks, but Earth Observation involves georeferenced, multi-modal data where each operation like reprojection or compositing alters the state and can propagate errors silently. The paper examines how common assumptions in agentic systems fail here because correctness requires geospatial consistency and physical validity, not just logical coherence. It argues these are not minor issues but fundamental, calling for agents built around structured geospatial state tracking, tool-aware reasoning that accounts for transformations, verifier-guided execution, and validity-aware evaluation. If true, this means generic agent frameworks cannot simply be applied to remote sensing without major redesign. A sympathetic reader would care because advancing reliable multi-step analysis could unlock better use of satellite data for monitoring and decision-making.

Core claim

The paper claims that the challenges in applying agentic AI to Earth Observation are structural, arising from the georeferenced, temporally structured, and physically constrained nature of EO data and workflows. Operations such as resampling and aggregation transform the underlying state, making errors propagate across steps in ways that generic systems do not handle. Therefore, EO-native agents must be designed with structured geospatial state, tool-aware reasoning, verifier-guided execution, and validity-aware learning to ensure correctness.

What carries the argument

EO-native agent design principles centered on structured geospatial state, tool-aware reasoning that respects data transformations, verifier-guided execution for consistency checks, and validity-aware learning and evaluation.

If this is right

  • Multi-step EO pipelines require explicit tracking of how operations transform geospatial properties to avoid undetected inconsistencies.
  • Verification must extend beyond logical coherence to include physical validity and temporal consistency across workflow steps.
  • Agent evaluation in EO settings needs metrics that capture geospatial accuracy and error propagation in addition to task completion.
  • New agent architectures tailored to physical and geospatial constraints will be essential rather than adaptations of general frameworks.
  • Reliable long-horizon reasoning becomes possible for applications such as data compositing and change detection once these principles are adopted.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same structural mismatch between generic agents and domain-specific state transformations may appear in other fields that involve coordinate systems or physical simulations.
  • Embedding domain verifiers as core components could become standard practice for agentic systems in scientific data analysis.
  • Empirical tests on public EO benchmark datasets could quantify how much the proposed design principles reduce silent failure rates compared with unmodified agents.

Load-bearing premise

That the identified failure modes and constraints in EO workflows cannot be adequately addressed through incremental extensions of existing generic agentic AI frameworks and instead require fundamentally new design principles.

What would settle it

A demonstration that a generic agentic system can complete a complex multi-step EO workflow such as time-series change detection involving reprojection, resampling, and aggregation while preserving physical validity and geospatial consistency without custom EO-specific modules.

Figures

Figures reproduced from arXiv: 2604.24919 by Akashah Shabbir, Begum Demir, Fahad Khan, Muhammad Akhtar Munir, Muhammad Haris Khan, Muhammad Umer Sheikh, Salman Khan, Xiao Xiang Zhu.

Figure 1
Figure 1. Figure 1: Evolution of AI paradigms in remote sensing toward agentic models. The field has progressed from task-specific predictive view at source ↗
Figure 2
Figure 2. Figure 2: Illustrative comparison of generic and EO-native agent traces for flood-area estimation from pre/post imagery. The generic view at source ↗
Figure 3
Figure 3. Figure 3: Implicit assumptions underlying generic agentic AI models and their mismatch with EO workflows. The figure summarizes view at source ↗
Figure 4
Figure 4. Figure 4: Structural properties of Earth observation environments for agentic models. EO reasoning operates within a layered envi view at source ↗
Figure 5
Figure 5. Figure 5: Agentic EO workflow and representative failure modes. The figure illustrates how errors introduced during data selection, view at source ↗
Figure 6
Figure 6. Figure 6: Agentic EO architecture and tool-integrated reasoning. view at source ↗
Figure 7
Figure 7. Figure 7: Design blueprint for agentic Earth observation. This consists of a Planner, Executor, and Verifier operating over a shared view at source ↗
read the original abstract

Earth Observation (EO) is moving beyond static prediction toward multi-step analytical workflows that require coordinated reasoning over data, tools, and geospatial state. While foundation models and vision-language models have advanced representation learning and language-grounded interaction in remote sensing, and agentic AI has shown strong potential for long-horizon reasoning and tool use, EO is not a straightforward extension of generic agentic AI. EO workflows operate on georeferenced, multi-modal, and temporally structured data, where operations such as reprojection, resampling, compositing, and aggregation transform the underlying state and can constrain later analysis. As a result, errors may propagate silently across steps, and correctness depends not only on internal coherence but also on geospatial consistency, temporally valid comparisons, and physical validity. This position paper argues that these challenges are structural rather than incidental. We examine the assumptions commonly made in generic agentic systems, analyze how they break in geospatial workflows, and characterize failure modes in multi-step EO pipelines. We then outline design principles for EO-native agents centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and validity-aware learning and evaluation. Building reliable geospatial agents, therefore, requires rethinking agent design around the physical, geospatial, and workflow constraints that govern EO analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This position paper claims that Earth Observation (EO) workflows introduce structural challenges for generic agentic AI systems because operations like reprojection, resampling, compositing, and aggregation transform geospatial state and can cause silent error propagation, requiring not only internal coherence but also geospatial and physical validity. It examines breakdowns of standard agent assumptions in multi-step EO pipelines and outlines four design principles for EO-native agents: structured geospatial state, tool-aware reasoning, verifier-guided execution, and validity-aware learning and evaluation.

Significance. If the structural nature of the challenges and the necessity of the proposed principles hold, the paper could meaningfully guide research at the intersection of agentic AI and remote sensing by highlighting domain-specific constraints that generic frameworks may not address through simple extensions. As a position paper it contributes by framing failure modes and research directions rather than presenting new empirical results.

major comments (2)
  1. [§3] §3 (failure modes in multi-step EO pipelines): the central assertion that the identified challenges are structural and cannot be adequately addressed by incremental extensions to generic agentic frameworks (e.g., adding georeferenced state graphs or precondition checkers) is not supported by a concrete counter-example or case where such an augmentation still produces unrecoverable geospatial inconsistency or silent error propagation.
  2. [§4] §4 (design principles): the four proposed principles are described at a high conceptual level without formal definitions, pseudocode, or a worked example showing how 'structured geospatial state' or 'verifier-guided execution' would be realized in an agent architecture and would demonstrably mitigate the failure modes from §3.
minor comments (2)
  1. [Abstract and §4] The abstract and §4 refer to 'validity-aware learning and evaluation' but the text provides no detail on the learning mechanism, loss functions, or evaluation protocol that would implement this principle.
  2. [§2] A small number of citations to recent agentic AI surveys or EO workflow papers could be added to strengthen the grounding of the assumptions examined in §2.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. As a position paper, our goal is to frame structural challenges and research directions rather than provide empirical benchmarks. We address the major comments below and will revise the manuscript to incorporate concrete illustrations and more formal elements where feasible.

read point-by-point responses
  1. Referee: [§3] §3 (failure modes in multi-step EO pipelines): the central assertion that the identified challenges are structural and cannot be adequately addressed by incremental extensions to generic agentic frameworks (e.g., adding georeferenced state graphs or precondition checkers) is not supported by a concrete counter-example or case where such an augmentation still produces unrecoverable geospatial inconsistency or silent error propagation.

    Authors: We agree that a concrete counter-example would make the structural claim more compelling. Section 3 analyzes how operations such as reprojection, resampling, and aggregation transform geospatial state and enable silent error propagation, but does not include an end-to-end case demonstrating failure of incremental extensions. In the revision we will add a worked illustrative pipeline (e.g., temporal compositing followed by change detection) showing that simply augmenting an agent with georeferenced state graphs and precondition checkers still permits unrecoverable inconsistency when physical validity constraints are not explicitly enforced. revision: yes

  2. Referee: [§4] §4 (design principles): the four proposed principles are described at a high conceptual level without formal definitions, pseudocode, or a worked example showing how 'structured geospatial state' or 'verifier-guided execution' would be realized in an agent architecture and would demonstrably mitigate the failure modes from §3.

    Authors: The principles are presented at a conceptual level because the paper is a position piece outlining research directions rather than an architectural specification. We acknowledge that formal definitions, pseudocode, and a mitigation example would improve clarity. In the revision we will introduce concise formal definitions for each principle, provide pseudocode for the structured geospatial state representation and verifier-guided execution loop, and include a worked example that directly maps back to the failure modes in §3 to illustrate mitigation. revision: yes

Circularity Check

0 steps flagged

Position paper identifies EO-specific agentic challenges without circular derivation

full rationale

The paper is a conceptual position piece that examines standard assumptions in generic agentic systems (stateless tool calls, internal coherence only) and describes how they break under EO operations such as reprojection and temporal compositing. No equations, fitted parameters, predictions, or self-citations appear in the provided text. The claim that challenges are structural and require new design principles is advanced by direct analysis of workflow constraints rather than by reducing to a prior self-citation or definitional loop. The absence of any load-bearing self-referential step keeps the derivation self-contained against external benchmarks of agent limitations and geospatial data properties.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central argument rests on domain assumptions about how geospatial operations affect data state and error propagation; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption EO workflows operate on georeferenced, multi-modal, and temporally structured data where operations such as reprojection and compositing transform the underlying state and constrain later analysis
    Invoked directly in the abstract as the basis for claiming challenges are structural.
  • domain assumption Errors may propagate silently across steps in multi-step EO pipelines, with correctness depending on geospatial consistency and physical validity
    Core premise used to differentiate EO from generic agentic settings.

pith-pipeline@v0.9.0 · 5545 in / 1303 out tokens · 46631 ms · 2026-05-14T20:52:20.286386+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

146 extracted references · 146 canonical work pages · 11 internal anchors

  1. [1]

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski,...

  2. [2]

    Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. 2024. Omnisat: Self-supervised modality fusion for earth observation. In European Conference on Computer Vision . Springer, 409–427

  3. [3]

    Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. 2025. AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modalities. In Proceedings of the Computer Vision and Pattern Recognition Conference . 19530–19540

  4. [4]

    Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2017. Segnet: A deep convolutional encoder-decoder architecture for image segmenta- tion. IEEE transactions on pattern analysis and machine intelligence 39, 12 (2017), 2481–2495

  5. [5]

    Wele Gedara Chaminda Bandara and Vishal M Patel. 2022. A transformer-based siamese network for change detection. In IGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium . IEEE, 207–210

  6. [6]

    Wele Gedara Chaminda Bandara and Vishal M. Patel. 2022. A Transformer-Based Siamese Network for Change Detection. arXiv preprint arXiv:2201.01293 (2022)

  7. [7]

    Favyen Bastani, Piper Wolters, Ritwik Gupta, Joe Ferdinando, and Aniruddha Kembhavi. 2023. Satlaspretrain: A large-scale dataset for remote sensing image understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision . 16772–16782

  8. [8]

    Yakoub Bazi, Laila Bashmal, Mohamad Mahmoud Al Rahhal, Riccardo Ricci, and Farid Melgani. 2024. Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery. Remote Sensing 16, 9 (2024), 1477

  9. [9]

    Nikolaos Ioannis Bountos, Arthur Ouaknine, Ioannis Papoutsis, and David Rolnick. 2025. Fomo: Multi-modal, multi-scale and multi-task remote sensing foundation models for forest monitoring. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 39. 27858–27868

  10. [10]

    Christopher F Brown, Michal R Kazmierski, Valerie J Pasquarella, William J Rucklidge, Masha Samsikova, Chenhui Zhang, Evan Shelhamer, Estefania Lahera, Olivia Wiles, Simon Ilyushchenko, et al. 2025. Alphaearth foundations: An embedding field model for accurate and efficient global mapping from sparse label data. arXiv preprint arXiv:2507.22291 (2025)

  11. [11]

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision . 9650–9660. 26 Munir et al

  12. [12]

    Binger Chen, Tacettin Emre Bök, Behnood Rasti, Volker Markl, and Begüm Demir. 2025. REMSA: An LLM Agent for Foundation Model Selection in Remote Sensing. arXiv preprint arXiv:2511.17442 (2025)

  13. [13]

    Hao Chen, Zipeng Qi, and Zhenwei Shi. 2021. Remote sensing image change detection with transformers. IEEE Transactions on Geoscience and Remote Sensing 60 (2021), 1–14

  14. [14]

    Hongruixuan Chen, Chen Wu, Bo Du, Liangpei Zhang, and Le Wang. 2019. Change detection in multisource VHR images via deep Siamese convolutional multiple-layers recurrent neural network. IEEE Transactions on Geoscience and Remote Sensing 58, 4 (2019), 2848–2864

  15. [15]

    Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-Decoder with Atrous Separable Convo- lution for Semantic Image Segmentation. In Computer Vision – ECCV 2018 . Lecture Notes in Computer Science, Vol. 11211. Springer, 833–851

  16. [16]

    Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, and Jonathan Tremblay. 2025. SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL. arXiv preprint arXiv:2512.04069 (2025)

  17. [17]

    Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Patwary, and Jiaxuan You. 2025. Multi-agent evolve: Llm self-improve through co-evolution. arXiv preprint arXiv:2510.23595 (2025)

  18. [18]

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. 2024. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. arXiv preprint arXiv:2312.14238 (2024)

  19. [19]

    Gong Cheng, Junwei Han, Peicheng Zhou, and Lei Guo. 2014. Multi-class Geospatial Object Detection and Geographic Image Classification Based on Collection of Part Detectors. ISPRS Journal of Photogrammetry and Remote Sensing 98 (2014), 119–132

  20. [20]

    Guangliang Cheng, Yunmeng Huang, Xiangtai Li, Shuchang Lyu, Zhaoyang Xu, Hongbo Zhao, Qi Zhao, and Shiming Xiang. 2024. Change detection methods for remote sensing in the last decade: A comprehensive review. Remote Sensing 16, 13 (2024), 2355

  21. [21]

    Gong Cheng, Peicheng Zhou, and Junwei Han. 2016. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE transactions on geoscience and remote sensing 54, 12 (2016), 7405–7415

  22. [22]

    Gordon Christie, Neil Fendley, James Wilson, and Ryan Mukherjee. 2018. Functional map of the world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 6172–6180

  23. [23]

    Meng Chu, Zhedong Zheng, Wei Ji, Tingyu Wang, and Tat-Seng Chua. 2024. Towards natural language-guided drones: GeoText-1652 benchmark with spatial relation matching. In European Conference on Computer Vision . Springer, 213–231

  24. [24]

    Kai Norman Clasen, Leonard Hackel, Tom Burgert, Gencer Sumbul, Begüm Demir, and Volker Markl. 2025. reben: Refined bigearthnet dataset for remote sensing image analysis. In IGARSS 2025-2025 IEEE International Geoscience and Remote Sensing Symposium . IEEE, 1264–1268

  25. [25]

    Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David Lobell, and Stefano Ermon. 2022. Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35 (2022), 197–211

  26. [26]

    Muhammad Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Shahbaz Khan, Paolo Fraccaro, Alexandre Lacoste, and Salman Khan. 2025. Geobench-vlm: Benchmarking vision-language models for geospatial tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision . 7132–7142

  27. [27]

    Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Muhammad Haris Khan, Rao Muhammad Anwer, Jorma Laak- sonen, Fahad Shahbaz Khan, and Salman Khan. 2025. TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation. arXiv preprint arXiv:2506.06281 (2025)

  28. [28]

    Jian Ding, Nan Xue, Yang Long, Gui-Song Xia, and Qikai Lu. 2019. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 2849–2858

  29. [29]

    Chuc Man Duc and Hiromichi Fukui. 2025. SatMamba: Development of foundation models for remote sensing imagery using state space models. arXiv preprint arXiv:2502.00435 (2025)

  30. [30]

    European Union and European Space Agency. 2026. Sentinel Online - Explore Copernicus satellite missions. https://sentinels.copernicus.eu/

  31. [31]

    Jie Feng, Shengyuan Wang, Tianhui Liu, Yanxin Xi, and Yong Li. 2025. UrbanLLaV A: A Multi-modal Large Language Model for Urban Intelligence. In Proceedings of the IEEE/CVF International Conference on Computer Vision . 6209–6219

  32. [32]

    Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. 2025. Onethinker: All-in-one reasoning model for image and video. arXiv preprint arXiv:2512.03043 (2025)

  33. [33]

    Peilin Feng, Zhutao Lv, Junyan Ye, Xiaolei Wang, Xinjie Huo, Jinhua Yu, Wanghan Xu, Wenlong Zhang, Lei Bai, Conghui He, et al. 2025. Earth- agent: Unlocking the full landscape of earth observation with agents. arXiv preprint arXiv:2509.23141 (2025)

  34. [34]

    Zhengpeng Feng, Clement Atzberger, Sadiq Jaffer, Jovana Knezevic, Silja Sormunen, Robin Young, Madeline C Lisaius, Markus Immitzer, Toby Jackson, James Ball, et al. 2025. Tessera: Temporal embeddings of surface spectra for earth representation and analysis. arXiv preprint arXiv:2506.20380 (2025)

  35. [35]

    Alistair Francis and Mikolaj Czerkawski. 2024. Major tom: Expandable datasets for earth observation. In IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium . IEEE, 2935–2940

  36. [36]

    Yingchun Fu, Zhe Zhu, Liangyun Liu, Wenfeng Zhan, Tao He, Huanfeng Shen, Jun Zhao, Yongxue Liu, Hongsheng Zhang, Zihan Liu, et al. 2024. Remote sensing time series analysis: A review of data and applications. Journal of Remote Sensing 4 (2024), 0285

  37. [37]

    Anthony Fuller, Koreen Millard, and James Green. 2023. CROMA: Remote sensing representations with contrastive radar-optical masked autoen- coders. Advances in Neural Information Processing Systems 36 (2023), 5506–5538. Agentic AI for Remote Sensing: Technical Challenges and Research Directions 27

  38. [38]

    Xin Guo, Jiangwei Lao, Bo Dang, Yingying Zhang, Lei Yu, Lixiang Ru, Liheng Zhong, Ziyuan Huang, Kang Wu, Dingxiang Hu, et al. 2024. Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 27672–27683

  39. [39]

    Ritwik Gupta, Richard Hosfelt, Sandra Sajeev, Nirav Patel, Bryce Goodman, Jigar Doshi, Eric Heim, Howie Choset, and Matthew Gaston. 2019. xbd: A dataset for assessing building damage from satellite imagery. arXiv preprint arXiv:1911.09296 (2019)

  40. [40]

    Boran Han, Shuai Zhang, Xingjian Shi, and Markus Reichstein. 2024. Bridging remote sensors with multisensor geospatial foundation models. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition . 27852–27862

  41. [41]

    Jiaming Han, Jian Ding, Jie Li, and Gui-Song Xia. 2021. Align deep features for oriented object detection. IEEE transactions on geoscience and remote sensing 60 (2021), 1–11

  42. [42]

    Md Hasebul Hasan, Mahir Labib Dihan, Tanzima Hashem, Mohammed Eunus Ali, and Md Rizwan Parvez. 2025. MapAgent: A Hierarchical Agent for Geospatial Reasoning with Dynamic Map Tool Integration. arXiv preprint arXiv:2509.05933 (2025)

  43. [43]

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 16000–16009

  44. [44]

    Henry Herzog, Favyen Bastani, Yawen Zhang, Gabriel Tseng, Joseph Redmon, Hadrien Sablon, Ryan Park, Jacob Morrison, Alexandra Buraczynski, Karen Farley, et al. 2025. OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation. arXiv preprint arXiv:2511.13655 (2025)

  45. [45]

    Danfeng Hong, Bing Zhang, Xuyang Li, Yuxuan Li, Chenyu Li, Jing Yao, Naoto Yokoya, Hao Li, Pedram Ghamisi, Xiuping Jia, et al. 2024. Spec- tralGPT: Spectral remote sensing foundation model. IEEE transactions on pattern analysis and machine intelligence 46, 8 (2024), 5227–5244

  46. [46]

    Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. 2025. Rsgpt: A remote sensing vision language model and benchmark. ISPRS Journal of Photogrammetry and Remote Sensing 224 (2025), 272–286

  47. [47]

    Yangyu Huang, Tianyi Gao, Haoran Xu, Qihao Zhao, Yang Song, Zhipeng Gui, Tengchao Lv, Hao Chen, Lei Cui, Scarlett Li, et al. 2025. Peace: Empowering geologic map holistic understanding with mllms. In Proceedings of the Computer Vision and Pattern Recognition Conference . 3899– 3908

  48. [48]

    Jeremy Andrew Irvin, Emily Ruoyu Liu, Joyce C Chen, Ines Dormoy, Jinyoung Kim, Samar Khanna, Zhuo Zheng, and Stefano Ermon. [n. d.]. TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data. In The Thirteenth International Conference on Learning Rep- resentations

  49. [49]

    Pallavi Jain, Bianca Schoen-Phelan, and Robert Ross. 2022. Self-supervised learning for invariant representations from multi-spectral and SAR images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 15 (2022), 7797–7808

  50. [50]

    Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, et al. 2025. Terramind: Large-scale generative multimodality for earth observation. In Proceedings of the IEEE/CVF International Conference on Computer Vision . 7383–7394

  51. [51]

    Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, et al. 2025. Verltool: Towards holistic agentic reinforcement learning with tool use. arXiv preprint arXiv:2509.01055 (2025)

  52. [52]

    Wentao Jiang, Jing Zhang, Qiming Zhang Di Wang, Zengmao Wang, and Bo Du. [n. d.]. Lemevit: Efficient vision transformer with learnable meta tokens for remote sensing image interpretation, 2024. URL https://arxiv. org/abs/2405.09789 2405 ([n. d.])

  53. [53]

    Chia Hsiang Kao, Wenting Zhao, Shreelekha Revankar, Samuel Speas, Snehal Bhagat, Rajeev Datta, Cheng Perng Phoo, Utkarsh Mall, Carl Von- drick, Kavita Bala, et al. 2025. Towards llm agents for earth observation. arXiv preprint arXiv:2504.12110 (2025)

  54. [54]

    Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. 2024. Geochat: Grounded large vision-language model for remote sensing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 27831– 27840

  55. [55]

    Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan David Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Andrew Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, Mehmet Gunturkun, Gabriel Huang, David Vazquez, Dava Newman, Yoshua Bengio, Stefano Ermon, and Xiao Xiang Zhu. 2023. GEO-Bench: Toward Foundation Models for Earth Monitoring. arXiv preprint ...

  56. [56]

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. 2024. Llava- onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

  57. [57]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning . PMLR, 19730–19742

  58. [58]

    Ke Li, Gang Wan, Gong Cheng, Liqiu Meng, and Junwei Han. 2020. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS journal of photogrammetry and remote sensing 159 (2020), 296–307

  59. [59]

    Ke Li, Gang Wan, Gong Cheng, Liqiu Meng, and Junwei Han. 2020. Object Detection in Optical Remote Sensing Images: A Survey and A New Benchmark. ISPRS Journal of Photogrammetry and Remote Sensing 159 (2020), 296–307

  60. [60]

    Kaiyu Li, Zepeng Xin, Li Pang, Chao Pang, Yupeng Deng, Jing Yao, Guisong Xia, Deyu Meng, Zhi Wang, and Xiangyong Cao. 2025. Segearth-r1: Geospatial pixel reasoning via large language model. arXiv preprint arXiv:2504.09644 (2025)

  61. [61]

    Xuyang Li, Danfeng Hong, and Jocelyn Chanussot. 2024. S2mae: A spatial-spectral pretraining foundation model for spectral remote sensing data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 24088–24097

  62. [62]

    Xuyang Li, Chenyu Li, Pedram Ghamisi, Danfeng Hong, Jon Atli Benediktsson, and Jocelyn Chanussot. 2026. Fleximo: A flexible remote sensing foundation model. IEEE Transactions on Geoscience and Remote Sensing (2026). 28 Munir et al

  63. [63]

    Zhihao Li, Biao Hou, Siteng Ma, Zitong Wu, Xianpeng Guo, Bo Ren, and Licheng Jiao. 2024. Masked angle-aware autoencoder for remote sensing images. In European Conference on Computer Vision . Springer, 260–278

  64. [64]

    Zhenshi Li, Dilxat Muhtar, Feng Gu, Yanglangxing He, Xueliang Zhang, Pengfeng Xiao, Guangjun He, and Xiaoxiang Zhu. 2025. Lhrs-bot-nova: Improved multimodal large language model for remote sensing vision-language interpretation. ISPRS Journal of Photogrammetry and Remote Sensing 227 (2025), 539–550

  65. [65]

    Chenyang Liu, Keyan Chen, Haotian Zhang, Zipeng Qi, Zhengxia Zou, and Zhenwei Shi. 2024. Change-agent: Toward interactive comprehensive remote sensing change interpretation and analysis. IEEE Transactions on Geoscience and Remote Sensing 62 (2024), 1–16

  66. [66]

    Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. 2024. Remoteclip: A vision language foundation model for remote sensing. IEEE Transactions on Geoscience and Remote Sensing 62 (2024), 1–16

  67. [67]

    Jiaqi Liu, Kaiwen Xiong, Peng Xia, Yiyang Zhou, Haonian Ji, Lu Feng, Siwei Han, Mingyu Ding, and Huaxiu Yao. 2025. Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning. arXiv preprint arXiv:2511.19900 (2025)

  68. [68]

    Sihan Liu, Yiwei Ma, Xiaoqing Zhang, Haowei Wang, Jiayi Ji, Xiaoshuai Sun, and Rongrong Ji. 2024. Rotated multi-scale interaction network for referring remote sensing image segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 26658–26668

  69. [69]

    Sylvain Lobry, Diego Marcos, Jesse Murray, and Devis Tuia. 2020. RSVQA: Visual Question Answering for Remote Sensing Data. arXiv preprint arXiv:2003.07333 (2020)

  70. [70]

    Meng Lu, Ran Xu, Yi Fang, Wenxuan Zhang, Yue Yu, Gaurav Srivastava, Yuchen Zhuang, Mohamed Elhoseiny, Charles Fleming, Carl Yang, et al

  71. [71]

    arXiv preprint arXiv:2511.19773 (2025)

    Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs. arXiv preprint arXiv:2511.19773 (2025)

  72. [72]

    Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, and James Zou. 2025. Octotools: An agentic framework with extensible tools for complex reasoning. arXiv preprint arXiv:2502.11271 (2025)

  73. [73]

    Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xuelong Li. 2017. Exploring Models and Data for Remote Sensing Image Caption Generation. arXiv preprint arXiv:1712.07835 (2017)

  74. [74]

    Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wang, Linlin Wang, Bo Dang, Jiangwei Lao, Jian Wang, Jingdong Chen, Yihua Tan, et al

  75. [75]

    Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision- language understanding,

    Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding. arXiv preprint arXiv:2406.10100 (2024)

  76. [76]

    Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, and Yansheng Li. 2025. When large vision-language model meets large remote sensing imagery: Coarse-to-fine text-guided token pruning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9206–9217

  77. [77]

    Lei Ma, Yu Liu, Xueliang Zhang, Yuanxin Ye, Gaofei Yin, and Brian Alan Johnson. 2019. Deep learning in remote sensing applications: A meta- analysis and review. ISPRS journal of photogrammetry and remote sensing 152 (2019), 166–177

  78. [78]

    Utkarsh Mall, Cheng Perng Phoo, Meilin Kelsey Liu, Carl Vondrick, Bharath Hariharan, and Kavita Bala. 2023. Remote sensing vision-language foundation models without annotations via ground remote alignment. arXiv preprint arXiv:2312.06960 (2023)

  79. [79]

    Oscar Manas, Alexandre Lacoste, Xavier Giró-i Nieto, David Vazquez, and Pau Rodriguez. 2021. Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In Proceedings of the IEEE/CVF international conference on computer vision . 9414–9423

  80. [80]

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. 2023. Gaia: a benchmark for general ai assistants. In The Twelfth International Conference on Learning Representations

Showing first 80 references.