arxiv: 2604.08896 · v1 · submitted 2026-04-10 · 💻 cs.CV

Recognition: no theorem link

GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing

Aoran Xiao , Shihao Cheng , Yonghao Xu , Yexian Ren , Hongruixuan Chen , Naoto Yokoya

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords geoscienceremote sensingmultimodal benchmarkmulti-agent frameworklarge language modelstool-augmented agentsGeoMMBenchGeoMMAgent

0 comments

The pith

A multi-agent framework with domain-specific remote sensing tools enables large language models to outperform standalone versions on complex geoscience tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GeoMMBench as a multimodal question-answering benchmark that spans diverse geoscience disciplines, sensor types, and tasks to test large language models more thoroughly than earlier efforts. Evaluation of 36 open-source and proprietary models reveals consistent gaps in domain knowledge, perceptual grounding, and reasoning needed for expert geospatial work. To close those gaps the authors build GeoMMAgent, a multi-agent system that routes queries through specialized retrieval, perception, and reasoning components backed by remote-sensing models and tools. Experiments on the benchmark show the agent framework delivers markedly higher accuracy than any single model operating alone. This result points to tool augmentation as a practical route toward reliable performance on the wide-ranging, heterogeneous problems typical of geoscience and remote sensing.

Core claim

GeoMMBench exposes systematic deficiencies in current multimodal large language models when faced with the breadth of disciplinary knowledge, sensor modalities, and task variety in geoscience and remote sensing. GeoMMAgent counters these deficiencies by orchestrating multiple agents that integrate retrieval of domain knowledge, perception via specialized remote-sensing models, and step-by-step reasoning, thereby achieving significantly higher performance than any standalone large language model on the same benchmark.

What carries the argument

GeoMMAgent, a multi-agent framework that routes tasks across retrieval, perception, and reasoning agents equipped with domain-specific remote sensing models and tools.

If this is right

Standalone multimodal models remain limited by missing domain knowledge and weak perceptual grounding in remote sensing data.
Strategic insertion of specialized tools and agents can close those gaps on heterogeneous, multi-disciplinary tasks.
Comprehensive benchmarks that vary sensors, disciplines, and question types are required to measure real progress toward expert-level capability.
Tool-augmented agents become the default architecture for applications that must combine broad scientific knowledge with sensor-specific interpretation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar multi-agent designs with domain tools could be tested in other sensor-heavy fields such as medical imaging or autonomous driving.
Developers may need explicit error-recovery mechanisms inside the agent loop to keep performance stable when individual tools misfire.
Public release of the benchmark and agent code would let independent groups measure whether the reported gains hold on new sensors or regions.

Load-bearing premise

The chosen domain-specific remote sensing models and tools supply reliable, unbiased gains on every task without injecting new errors from tool integration or retrieval failures.

What would settle it

A controlled test in which GeoMMAgent scores lower than the best standalone model on a fresh set of geoscience questions because of tool errors or retrieval failures would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2604.08896 by Aoran Xiao, Hongruixuan Chen, Naoto Yokoya, Shihao Cheng, Yexian Ren, Yonghao Xu.

**Figure 2.** Figure 2: Examples from GeoMMBench, covering multiple disciplines, diverse sensor modalities, and a wide range of task types. Answer [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of GeoMMAgent, a multi-agent framework that plans, executes, and self-evaluates multimodal tasks for expert-level [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Representative error cases from advanced MLLMs specific to geospatial tasks. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Recent advances in multimodal large language models (MLLMs) have accelerated progress in domain-oriented AI, yet their development in geoscience and remote sensing (RS) remains constrained by distinctive challenges: wide-ranging disciplinary knowledge, heterogeneous sensor modalities, and a fragmented spectrum of tasks. To bridge these gaps, we introduce GeoMMBench, a comprehensive multimodal question-answering benchmark covering diverse RS disciplines, sensors, and tasks, enabling broader and more rigorous evaluation than prior benchmarks. Using GeoMMBench, we assess 36 open-source and proprietary large language models, uncovering systematic deficiencies in domain knowledge, perceptual grounding, and reasoning--capabilities essential for expert-level geospatial interpretation. Beyond evaluation, we propose GeoMMAgent, a multi-agent framework that strategically integrates retrieval, perception, and reasoning through domain-specific RS models and tools. Extensive experimental results demonstrate that GeoMMAgent significantly outperforms standalone LLMs, underscoring the importance of tool-augmented agents for dynamically tackling complex geoscience and RS challenges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New benchmark for RS multimodal tasks plus an agent that claims to beat LLMs, but the outperformance rests on unexamined tool reliability.

read the letter

This paper introduces GeoMMBench, a multimodal QA benchmark covering a range of geoscience and remote sensing disciplines, sensors, and tasks, plus GeoMMAgent, a multi-agent system that adds retrieval, perception, and domain-specific tools to improve on standalone LLMs. They evaluate 36 models and report that the agent does better on the new benchmark. The breadth of the benchmark and the scale of the model comparison are the clearest contributions here. It is useful to see explicit gaps in perceptual grounding and reasoning for this domain spelled out. The agent design is a logical response to those gaps and fits the practical needs of RS work where general models often need external help. The soft spots are in the missing details. The description gives little on how the benchmark questions were sourced, validated, or split, and no error analysis or significance tests appear. For the agent results, there are no ablations that turn tools on and off or reports on retrieval and tool failure rates. This leaves the headline claim vulnerable exactly where the stress-test note flags it: we cannot separate gains from the multi-agent structure versus from the particular tools chosen. The citation pattern looks standard and does not raise separate issues. This work is aimed at researchers building or testing multimodal systems for earth observation and remote sensing. Readers who need new test sets or ideas for tool-augmented agents in narrow domains will find usable material, though they should view the performance numbers as preliminary. It deserves peer review because the benchmark itself is a reusable resource even if the agent experiments need tighter controls. I would recommend sending it to referees with a request for expanded methods sections.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces GeoMMBench, a new multimodal QA benchmark spanning diverse geoscience and remote sensing disciplines, sensors, and tasks. It evaluates 36 open-source and proprietary LLMs on the benchmark, documenting deficiencies in domain knowledge, perceptual grounding, and reasoning. It then presents GeoMMAgent, a multi-agent framework that combines retrieval, perception, and reasoning modules with domain-specific RS models and tools, and reports that this agent significantly outperforms standalone LLMs.

Significance. If the benchmark construction and performance claims are substantiated, the work supplies a needed standardized evaluation resource for multimodal models in remote sensing and provides evidence that tool-augmented multi-agent systems can address limitations of pure LLMs on complex geospatial tasks. The scale of the 36-model evaluation and the explicit focus on heterogeneous sensor modalities are strengths that could influence future domain-specific agent research.

major comments (1)

[Abstract and framework description] Abstract and framework description: The central claim that GeoMMAgent significantly outperforms standalone LLMs rests on the assumption that the integrated domain-specific RS tools and retrieval modules deliver net-positive contributions. The manuscript provides no per-tool accuracy metrics, failure-rate breakdowns, or ablation experiments that disable individual tools or the retrieval module while preserving the agent scaffold, preventing clear attribution of gains to the architecture versus tool selection.

minor comments (1)

[Abstract] Abstract: The description of benchmark construction, data splits, statistical significance testing, and error analysis is absent, which limits immediate assessment of result robustness even though these details may appear later in the manuscript.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential value of GeoMMBench and GeoMMAgent. We address the major comment below with a direct response and commit to revisions that strengthen the attribution of results.

read point-by-point responses

Referee: Abstract and framework description: The central claim that GeoMMAgent significantly outperforms standalone LLMs rests on the assumption that the integrated domain-specific RS tools and retrieval modules deliver net-positive contributions. The manuscript provides no per-tool accuracy metrics, failure-rate breakdowns, or ablation experiments that disable individual tools or the retrieval module while preserving the agent scaffold, preventing clear attribution of gains to the architecture versus tool selection.

Authors: We appreciate the referee's emphasis on rigorous attribution of performance gains. The current manuscript reports end-to-end results on GeoMMBench showing that GeoMMAgent achieves substantially higher accuracy than the 36 evaluated standalone LLMs. However, we acknowledge that the manuscript does not include per-tool accuracy metrics, failure-rate breakdowns, or ablation studies that systematically disable the retrieval module or individual domain-specific RS tools while retaining the multi-agent scaffold. These analyses would indeed allow clearer isolation of each component's contribution. In the revised manuscript we will add targeted ablation experiments (including variants with the retrieval module removed and with specific perception or reasoning tools disabled) together with per-component performance tables and failure analyses. This will directly address the concern and strengthen the evidence that the tool-augmented architecture, rather than tool selection alone, drives the observed improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark and empirical agent evaluation are self-contained

full rationale

The paper creates GeoMMBench as an independent evaluation resource and introduces GeoMMAgent as a tool-augmented multi-agent system, then reports comparative performance numbers on that benchmark. No equations, fitted parameters, or first-principles derivations are present that could reduce to their own inputs by construction. Self-citations, if any, are not load-bearing for the central empirical claim, which rests on fresh experimental results rather than prior author work or renamed known patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work relies on standard AI assumptions about tool integration benefits and introduces no fitted parameters or new physical entities; the benchmark and agent are constructed rather than derived from axioms.

axioms (1)

domain assumption Integration of retrieval, perception, and reasoning modules via domain-specific tools improves multimodal performance on geoscience tasks
Invoked in the design and claimed superiority of GeoMMAgent

invented entities (1)

GeoMMAgent multi-agent framework no independent evidence
purpose: Strategic integration of retrieval, perception, and reasoning for RS challenges
Newly proposed system whose performance is demonstrated only through the paper's experiments

pith-pipeline@v0.9.0 · 5491 in / 1214 out tokens · 28605 ms · 2026-05-10T17:54:08.842532+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 18 canonical work pages · 10 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadal- lah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219, 2024. 6, 7

work page internal anchor Pith review arXiv 2024
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Choice: Benchmarking the remote sensing capabilities of large vision-language models.Advances in Neural Information Processing Systems, 2025

Xiao An, Jiaxing Sun, Zihan Gui, and Wei He. Choice: Benchmarking the remote sensing capabilities of large vision-language models.Advances in Neural Information Processing Systems, 2025. 4

2025
[4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

A systematic classification of knowl- edge, reasoning, and context within the arc dataset.arXiv preprint arXiv:1806.00358, 2018

Michael Boratko, Harshit Padigela, Divyendra Mikkili- neni, Pritish Yuvraj, Rajarshi Das, Andrew McCallum, Maria Chang, Achille Fokoue-Nkoutche, Pavan Kapanipathi, Nicholas Mattei, et al. A systematic classification of knowl- edge, reasoning, and context within the arc dataset.arXiv preprint arXiv:1806.00358, 2018. 2

work page arXiv 2018
[6]

Encoder-decoder with atrous separable convolution for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018. 5

2018
[7]

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling, 2024

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yimin Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, J...

2024
[8]

Nwpu-captions dataset and mlca-net for remote sensing image captioning.IEEE Transactions on Geoscience and Remote Sensing, 60:1–19,

Qimin Cheng, Haiyan Huang, Yuan Xu, Yuzhuo Zhou, Huanying Li, and Zhongyuan Wang. Nwpu-captions dataset and mlca-net for remote sensing image captioning.IEEE Transactions on Geoscience and Remote Sensing, 60:1–19,
[9]

Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching

Meng Chu, Zhedong Zheng, Wei Ji, Tingyu Wang, and Tat-Seng Chua. Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching. In ECCV, 2024. 4

2024
[10]

Geobench-vlm: Benchmarking vision-language models for geospatial tasks.arXiv preprint arXiv:2411.19325, 2024

Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Shahbaz Khan, Paolo Fraccaro, Alexandre Lacoste, and Salman Khan. Geobench-vlm: Benchmarking vision-language models for geospatial tasks.arXiv preprint arXiv:2411.19325, 2024. 1, 4

work page arXiv 2024
[11]

Introducing gemini 2.0: Our new ai model for the agentic era

Google DeepMind. Introducing gemini 2.0: Our new ai model for the agentic era. Technical report, Google Deep- Mind, 2024. Accessed: 2025-02-26. 6, 7

2024
[12]

Internlm-xcomposer2-4khd: A pioneer- ing large vision-language model handling resolutions from 336 pixels to 4k hd, 2024

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2-4khd: A pioneer- ing large vision-language mo...

2024
[13]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdelrahman Abouelenin et al. Phi-4-mini technical re- port: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025. 7

work page internal anchor Pith review arXiv 2025
[14]

The llama 3 herd of models, 2024

Aaron Grattafiori et al. The llama 3 herd of models, 2024. 6, 7

2024
[15]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2009
[16]

Model context protocol (mcp): Landscape, security threats, and future research directions, 2025

Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. Model context protocol (mcp): Landscape, security threats, and future research directions, 2025. 5

2025
[17]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline pro- fessional videos.arXiv preprint arXiv:2501.13826, 2025. 2

work page internal anchor Pith review arXiv 2025
[18]

Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm

Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 22170–22183, 2024. 1

2024
[19]

Huang, H

Mingxin Huang, Yuliang Liu, Dingkang Liang, Lianwen Jin, and Xiang Bai. Mini-monkey: Alleviating the semantic saw- tooth effect for lightweight mllms via complementary image pyramid.arXiv preprint arXiv:2408.02034, 2024. 7

work page arXiv 2024
[20]

Teochat: A large vision-language assis- tant for temporal earth observation data

Jeremy Andrew Irvin, Emily Ruoyu Liu, Joyce Chuyi Chen, Ines Dormoy, Jinyoung Kim, Samar Khanna, Zhuo Zheng, 9 and Stefano Ermon. Teochat: A large vision-language assis- tant for temporal earth observation data. InICLR, 2025. 1, 2, 4, 6, 7

2025
[21]

YOLOv11: An Overview of the Key Architectural Enhancements

Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements.arXiv preprint arXiv:2410.17725, 2024. 5

work page internal anchor Pith review arXiv 2024
[22]

Geochat: Grounded large vision-language model for remote sensing

Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27831– 27840, 2024. 1, 2, 4, 6, 7

2024
[23]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 1

work page internal anchor Pith review arXiv 2023
[24]

Llava-onevision: Easy visual task transfer, 2024

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer, 2024. 6, 7

2024
[25]

Vrsbench: A versatile vision-language benchmark dataset for remote sens- ing image understanding

Xiang Li, Jian Ding, and Mohamed Elhoseiny. Vrsbench: A versatile vision-language benchmark dataset for remote sens- ing image understanding. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Bench- marks Track. 2, 4
[26]

Improved baselines with visual instruction tuning, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2024. 2, 6, 7

2024
[27]

Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 6, 7

2024
[28]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 1

2024
[29]

Nvila: Efficient frontier visual language models, 2024

Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yux- ian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vish- wesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, and Yao Lu. Nvila:...

2024
[30]

Yang Long, Gui-Song Xia, Shengyang Li, Wen Yang, Michael Ying Yang, Xiao Xiang Zhu, Liangpei Zhang, and Deren Li. On creating benchmark dataset for aerial image interpretation: Reviews, guidances, and million-aid.IEEE Journal of selected topics in applied earth observations and remote sensing, 14:4205–4230, 2021. 5

2021
[31]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,
[32]

Exploring models and data for remote sensing im- age caption generation.IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017

Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xue- long Li. Exploring models and data for remote sensing im- age caption generation.IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017. 2

2017
[33]

Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision- language understanding,

Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wang, Linlin Wang, Bo Dang, Jiangwei Lao, Jian Wang, Jing- dong Chen, Yihua Tan, and Yansheng Li. Skysensegpt: A fine-grained instruction tuning dataset and model for re- mote sensing vision-language understanding.arXiv preprint arXiv:2406.10100, 2024. 7

work page arXiv 2024
[34]

When large vision-language model meets large remote sensing imagery: Coarse-to-fine text-guided token pruning

Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, and Yansheng Li. When large vision-language model meets large remote sensing imagery: Coarse-to-fine text-guided token pruning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9206–9217, 2025. 2

2025
[35]

Levels of agi: Opera- tionalizing progress on the path to agi

Meredith Ringel Morris, Jascha Sohl-Dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Farabet, and Shane Legg. Levels of agi: Oper- ationalizing progress on the path to agi.arXiv preprint arXiv:2311.02462, 2023. 1, 2

work page arXiv 2023
[36]

Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model

Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model. In European Conference on Computer Vision, pages 440–457. Springer, 2024. 2, 4, 6, 7

2024
[37]

Gpt-4v system card

OpenAI. Gpt-4v system card. Technical report, OpenAI,
[39]

Hello gpt-4o

OpenAI. Hello gpt-4o. Technical report, OpenAI, 2024. Ac- cessed: 2025-02-26. 6, 7, 2

2024
[40]

Openai o1 system card

OpenAI. Openai o1 system card. Technical report, OpenAI,
[41]

Accessed: 2025-02-26. 6, 7

2025
[42]

Gpt-5 system card

OpenAI. Gpt-5 system card. Technical report, OpenAI,
[43]

Accessed on 2025-11-03. 7

2025
[44]

Vhm: Versatile and honest vision lan- guage model for remote sensing image analysis

Chao Pang, Xingxing Weng, Jiang Wu, Jiayu Li, Yi Liu, Ji- axing Sun, Weijia Li, Shuai Wang, Litong Feng, Gui-Song Xia, and Conghui He. Vhm: Versatile and honest vision lan- guage model for remote sensing image analysis. InAAAI conference on artificial intelligence, 2025. 2, 6, 7

2025
[45]

arXiv preprint arXiv:2412.15190 (2025),https://arxiv.org/abs/2412.151905

Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fa- had Shahbaz Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues.arXiv preprint arXiv:2412.15190, 2024. 1

work page arXiv 2024
[46]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Qwen2.5-vl, 2025

Qwen Team. Qwen2.5-vl, 2025. 6, 7

2025
[48]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. 6, 7

2025
[49]

Fengxiang Wang, Hongzhen Wang, Zonghao Guo, Di Wang, Yulin Wang, Mingshuo Chen, Qiang Ma, Long Lan, Wenjing Yang, Jing Zhang, et al. Xlrs-bench: Could your multimodal llms understand extremely large ultra-high-resolution remote sensing imagery? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14325–14336,
[50]

arXiv preprint arXiv:2110.08733

Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation.arXiv preprint arXiv:2110.08733, 2021. 5

work page arXiv 2021
[51]

Earthvqa: Towards queryable earth via re- lational reasoning-based remote sensing visual question an- swering

Junjue Wang, Zhuo Zheng, Zihang Chen, Ailong Ma, and Yanfei Zhong. Earthvqa: Towards queryable earth via re- lational reasoning-based remote sensing visual question an- swering. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5481–5489, 2024. 1, 2, 4

2024
[52]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024. 6, 7

2024
[53]

Cogvlm: Visual expert for pretrained language models, 2024

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models, 2024. 6, 7

2024
[54]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. InThe Thirty-eight Conference on Neural Infor- mation Processing Systems Datasets and Benchmarks Track,
[55]

Skyscript: A large and seman- tically diverse vision-language dataset for remote sensing

Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, and Ram Rajagopal. Skyscript: A large and seman- tically diverse vision-language dataset for remote sensing. InProceedings of the AAAI Conference on Artificial Intel- ligence, pages 5805–5813, 2024. 2

2024
[56]

Sarlang-1m: A benchmark for vision-language modeling in sar image un- derstanding.IEEE Transactions on Geoscience and Remote Sensing, 2026

Yimin Wei, Aoran Xiao, Yexian Ren, Yuting Zhu, Hongruix- uan Chen, Junshi Xia, and Naoto Yokoya. Sarlang-1m: A benchmark for vision-language modeling in sar image un- derstanding.IEEE Transactions on Geoscience and Remote Sensing, 2026. 2

2026
[57]

Dota: A large-scale dataset for object detection in aerial images

Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Be- longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang- pei Zhang. Dota: A large-scale dataset for object detection in aerial images. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 5

2018
[58]

Foundation models for remote sensing and earth observation: A survey

Aoran Xiao, Weihao Xuan, Junjue Wang, Jiaxing Huang, Dacheng Tao, Shijian Lu, and Naoto Yokoya. Foundation models for remote sensing and earth observation: A survey. IEEE Geoscience and Remote Sensing Magazine, 2025. 2

2025
[59]

Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024. 1

2024
[60]

Exploring a fine-grained multiscale method for cross-modal remote sensing image re- trieval.IEEE Transactions on Geoscience and Remote Sens- ing, 60:3078451, 2022

Zhiqiang Yuan, Wenkai Zhang, Kun Fu, Xuan Li, Chubo Deng, Hongqi Wang, and Xian Sun. Exploring a fine-grained multiscale method for cross-modal remote sensing image re- trieval.IEEE Transactions on Geoscience and Remote Sens- ing, 60:3078451, 2022. 2

2022
[61]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556– 9567, 2024. 1, 2

2024
[62]

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi- discipline multimodal understanding benchmark.arXiv preprint arXiv:2409.02813, 2024. 2

work page internal anchor Pith review arXiv 2024
[63]

Skyeyegpt: Uni- fying remote sensing vision-language tasks via instruction tuning with large language model.ISPRS Journal of Pho- togrammetry and Remote Sensing, 221:64–77, 2025

Yang Zhan, Zhitong Xiong, and Yuan Yuan. Skyeyegpt: Uni- fying remote sensing vision-language tasks via instruction tuning with large language model.ISPRS Journal of Pho- togrammetry and Remote Sensing, 221:64–77, 2025. 2

2025
[64]

Good at captioning, bad at counting: Benchmarking gpt-4v on earth observation data

Chenhui Zhang and Sherrie Wang. Good at captioning, bad at counting: Benchmarking gpt-4v on earth observation data. In2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition Workshops (CVPRW), pages 7839–7849. IEEE, 2024. 4

2024
[65]

Lmms- eval: Reality check on the evaluation of large multimodal models, 2024

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms- eval: Reality check on the evaluation of large multimodal models, 2024. 6

2024
[66]

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, et al. Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual in- put and output.arXiv preprint arXiv:2407.03320, 2024. 6, 7

work page arXiv 2024
[67]

Gme: Improving universal multimodal retrieval by multimodal llms, 2025

Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms, 2025. 5

2025
[68]

Agieval: A human-centric benchmark for evaluating foundation models, 2023

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023. 2

2023
[69]

Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling

Qi Zhu, Jiangwei Lao, Deyi Ji, Junwei Luo, Kang Wu, Yingying Zhang, Lixiang Ru, Jian Wang, Jingdong Chen, Ming Yang, Dong Liu, and Feng Zhao. Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 14733–14744, 2025. 2 1...

2025
[70]

RS” (Remote Sensing), “Ph

More Descriptions on GeoMMBench 7.1. Dimensions and Tasks in GeoMMBench Below we provide explanations for the abbreviations of evaluation dimensions in GeoMMBench, as listed in Tables 1 and 3 of the paper, along with their corresponding tasks. Disciplines:“RS” (Remote Sensing), “Ph.” (Photogramme- try), “GIS” (Geographic Information System), and “GNSS” (G...
[71]

Toolkit Library We present the tools integrated into GeoMMAgent

More Descriptions on GeoMMAgent 8.1. Toolkit Library We present the tools integrated into GeoMMAgent. As shown in Fig. 4 and Section 3 of the manuscript, the toolkit library is organized into four categories:general toolkit, knowledge toolkit,perception toolkit, andreasoning toolkit. GeoMMAgent is designed as a fully training free and exten- sible framewo...
[72]

The model recognizes 51 scene categories and land cover types, covering the major classes com- monly used in remote sensing scene understanding

dataset. The model recognizes 51 scene categories and land cover types, covering the major classes com- monly used in remote sensing scene understanding. The toolkit outputs top five predictions with confidence scores to support precise interpretation of scene semantics. •Detection model: We deploy a pre trained Yolo11 detec- tor [21] with backbone CSPNet...
[73]

Band 1⃝properties

dataset. It employs oriented bounding boxes to de- tect and localize diverse geospatial objects such as air- craft, vehicles, and buildings. The toolkit outputs object counts, spatial distributions, and detection reports that in- clude class labels and confidence values. •Segmentation model: We train a DeepLabv3 plus model with Xception backbone [6] on th...
[74]

Limitation Like any benchmark, GeoMMBench has limitations despite its comprehensive design. The manual curation process may introduce selection biases, and the chosen knowledge points, while diverse, cannot fully represent the complete breadth and depth required for evaluating an Expert AGI in geoscience and remote sensing. Even so, we argue that strong p...