Recognition: no theorem link
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
Pith reviewed 2026-05-10 17:54 UTC · model grok-4.3
The pith
A multi-agent framework with domain-specific remote sensing tools enables large language models to outperform standalone versions on complex geoscience tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GeoMMBench exposes systematic deficiencies in current multimodal large language models when faced with the breadth of disciplinary knowledge, sensor modalities, and task variety in geoscience and remote sensing. GeoMMAgent counters these deficiencies by orchestrating multiple agents that integrate retrieval of domain knowledge, perception via specialized remote-sensing models, and step-by-step reasoning, thereby achieving significantly higher performance than any standalone large language model on the same benchmark.
What carries the argument
GeoMMAgent, a multi-agent framework that routes tasks across retrieval, perception, and reasoning agents equipped with domain-specific remote sensing models and tools.
If this is right
- Standalone multimodal models remain limited by missing domain knowledge and weak perceptual grounding in remote sensing data.
- Strategic insertion of specialized tools and agents can close those gaps on heterogeneous, multi-disciplinary tasks.
- Comprehensive benchmarks that vary sensors, disciplines, and question types are required to measure real progress toward expert-level capability.
- Tool-augmented agents become the default architecture for applications that must combine broad scientific knowledge with sensor-specific interpretation.
Where Pith is reading between the lines
- Similar multi-agent designs with domain tools could be tested in other sensor-heavy fields such as medical imaging or autonomous driving.
- Developers may need explicit error-recovery mechanisms inside the agent loop to keep performance stable when individual tools misfire.
- Public release of the benchmark and agent code would let independent groups measure whether the reported gains hold on new sensors or regions.
Load-bearing premise
The chosen domain-specific remote sensing models and tools supply reliable, unbiased gains on every task without injecting new errors from tool integration or retrieval failures.
What would settle it
A controlled test in which GeoMMAgent scores lower than the best standalone model on a fresh set of geoscience questions because of tool errors or retrieval failures would falsify the performance claim.
Figures
read the original abstract
Recent advances in multimodal large language models (MLLMs) have accelerated progress in domain-oriented AI, yet their development in geoscience and remote sensing (RS) remains constrained by distinctive challenges: wide-ranging disciplinary knowledge, heterogeneous sensor modalities, and a fragmented spectrum of tasks. To bridge these gaps, we introduce GeoMMBench, a comprehensive multimodal question-answering benchmark covering diverse RS disciplines, sensors, and tasks, enabling broader and more rigorous evaluation than prior benchmarks. Using GeoMMBench, we assess 36 open-source and proprietary large language models, uncovering systematic deficiencies in domain knowledge, perceptual grounding, and reasoning--capabilities essential for expert-level geospatial interpretation. Beyond evaluation, we propose GeoMMAgent, a multi-agent framework that strategically integrates retrieval, perception, and reasoning through domain-specific RS models and tools. Extensive experimental results demonstrate that GeoMMAgent significantly outperforms standalone LLMs, underscoring the importance of tool-augmented agents for dynamically tackling complex geoscience and RS challenges.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces GeoMMBench, a new multimodal QA benchmark spanning diverse geoscience and remote sensing disciplines, sensors, and tasks. It evaluates 36 open-source and proprietary LLMs on the benchmark, documenting deficiencies in domain knowledge, perceptual grounding, and reasoning. It then presents GeoMMAgent, a multi-agent framework that combines retrieval, perception, and reasoning modules with domain-specific RS models and tools, and reports that this agent significantly outperforms standalone LLMs.
Significance. If the benchmark construction and performance claims are substantiated, the work supplies a needed standardized evaluation resource for multimodal models in remote sensing and provides evidence that tool-augmented multi-agent systems can address limitations of pure LLMs on complex geospatial tasks. The scale of the 36-model evaluation and the explicit focus on heterogeneous sensor modalities are strengths that could influence future domain-specific agent research.
major comments (1)
- [Abstract and framework description] Abstract and framework description: The central claim that GeoMMAgent significantly outperforms standalone LLMs rests on the assumption that the integrated domain-specific RS tools and retrieval modules deliver net-positive contributions. The manuscript provides no per-tool accuracy metrics, failure-rate breakdowns, or ablation experiments that disable individual tools or the retrieval module while preserving the agent scaffold, preventing clear attribution of gains to the architecture versus tool selection.
minor comments (1)
- [Abstract] Abstract: The description of benchmark construction, data splits, statistical significance testing, and error analysis is absent, which limits immediate assessment of result robustness even though these details may appear later in the manuscript.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential value of GeoMMBench and GeoMMAgent. We address the major comment below with a direct response and commit to revisions that strengthen the attribution of results.
read point-by-point responses
-
Referee: Abstract and framework description: The central claim that GeoMMAgent significantly outperforms standalone LLMs rests on the assumption that the integrated domain-specific RS tools and retrieval modules deliver net-positive contributions. The manuscript provides no per-tool accuracy metrics, failure-rate breakdowns, or ablation experiments that disable individual tools or the retrieval module while preserving the agent scaffold, preventing clear attribution of gains to the architecture versus tool selection.
Authors: We appreciate the referee's emphasis on rigorous attribution of performance gains. The current manuscript reports end-to-end results on GeoMMBench showing that GeoMMAgent achieves substantially higher accuracy than the 36 evaluated standalone LLMs. However, we acknowledge that the manuscript does not include per-tool accuracy metrics, failure-rate breakdowns, or ablation studies that systematically disable the retrieval module or individual domain-specific RS tools while retaining the multi-agent scaffold. These analyses would indeed allow clearer isolation of each component's contribution. In the revised manuscript we will add targeted ablation experiments (including variants with the retrieval module removed and with specific perception or reasoning tools disabled) together with per-component performance tables and failure analyses. This will directly address the concern and strengthen the evidence that the tool-augmented architecture, rather than tool selection alone, drives the observed improvements. revision: yes
Circularity Check
No circularity: new benchmark and empirical agent evaluation are self-contained
full rationale
The paper creates GeoMMBench as an independent evaluation resource and introduces GeoMMAgent as a tool-augmented multi-agent system, then reports comparative performance numbers on that benchmark. No equations, fitted parameters, or first-principles derivations are present that could reduce to their own inputs by construction. Self-citations, if any, are not load-bearing for the central empirical claim, which rests on fresh experimental results rather than prior author work or renamed known patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Integration of retrieval, perception, and reasoning modules via domain-specific tools improves multimodal performance on geoscience tasks
invented entities (1)
-
GeoMMAgent multi-agent framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadal- lah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219, 2024. 6, 7
work page internal anchor Pith review arXiv 2024
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Choice: Benchmarking the remote sensing capabilities of large vision-language models.Advances in Neural Information Processing Systems, 2025
Xiao An, Jiaxing Sun, Zihan Gui, and Wei He. Choice: Benchmarking the remote sensing capabilities of large vision-language models.Advances in Neural Information Processing Systems, 2025. 4
2025
-
[4]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Michael Boratko, Harshit Padigela, Divyendra Mikkili- neni, Pritish Yuvraj, Rajarshi Das, Andrew McCallum, Maria Chang, Achille Fokoue-Nkoutche, Pavan Kapanipathi, Nicholas Mattei, et al. A systematic classification of knowl- edge, reasoning, and context within the arc dataset.arXiv preprint arXiv:1806.00358, 2018. 2
-
[6]
Encoder-decoder with atrous separable convolution for semantic image segmentation
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018. 5
2018
-
[7]
Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling, 2024
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yimin Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, J...
2024
-
[8]
Nwpu-captions dataset and mlca-net for remote sensing image captioning.IEEE Transactions on Geoscience and Remote Sensing, 60:1–19,
Qimin Cheng, Haiyan Huang, Yuan Xu, Yuzhuo Zhou, Huanying Li, and Zhongyuan Wang. Nwpu-captions dataset and mlca-net for remote sensing image captioning.IEEE Transactions on Geoscience and Remote Sensing, 60:1–19,
-
[9]
Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching
Meng Chu, Zhedong Zheng, Wei Ji, Tingyu Wang, and Tat-Seng Chua. Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching. In ECCV, 2024. 4
2024
-
[10]
Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Shahbaz Khan, Paolo Fraccaro, Alexandre Lacoste, and Salman Khan. Geobench-vlm: Benchmarking vision-language models for geospatial tasks.arXiv preprint arXiv:2411.19325, 2024. 1, 4
-
[11]
Introducing gemini 2.0: Our new ai model for the agentic era
Google DeepMind. Introducing gemini 2.0: Our new ai model for the agentic era. Technical report, Google Deep- Mind, 2024. Accessed: 2025-02-26. 6, 7
2024
-
[12]
Internlm-xcomposer2-4khd: A pioneer- ing large vision-language model handling resolutions from 336 pixels to 4k hd, 2024
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2-4khd: A pioneer- ing large vision-language mo...
2024
-
[13]
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Abdelrahman Abouelenin et al. Phi-4-mini technical re- port: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025. 7
work page internal anchor Pith review arXiv 2025
-
[14]
The llama 3 herd of models, 2024
Aaron Grattafiori et al. The llama 3 herd of models, 2024. 6, 7
2024
-
[15]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020. 2
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[16]
Model context protocol (mcp): Landscape, security threats, and future research directions, 2025
Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. Model context protocol (mcp): Landscape, security threats, and future research directions, 2025. 5
2025
-
[17]
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline pro- fessional videos.arXiv preprint arXiv:2501.13826, 2025. 2
work page internal anchor Pith review arXiv 2025
-
[18]
Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm
Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 22170–22183, 2024. 1
2024
- [19]
-
[20]
Teochat: A large vision-language assis- tant for temporal earth observation data
Jeremy Andrew Irvin, Emily Ruoyu Liu, Joyce Chuyi Chen, Ines Dormoy, Jinyoung Kim, Samar Khanna, Zhuo Zheng, 9 and Stefano Ermon. Teochat: A large vision-language assis- tant for temporal earth observation data. InICLR, 2025. 1, 2, 4, 6, 7
2025
-
[21]
YOLOv11: An Overview of the Key Architectural Enhancements
Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements.arXiv preprint arXiv:2410.17725, 2024. 5
work page internal anchor Pith review arXiv 2024
-
[22]
Geochat: Grounded large vision-language model for remote sensing
Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27831– 27840, 2024. 1, 2, 4, 6, 7
2024
-
[23]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 1
work page internal anchor Pith review arXiv 2023
-
[24]
Llava-onevision: Easy visual task transfer, 2024
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer, 2024. 6, 7
2024
-
[25]
Vrsbench: A versatile vision-language benchmark dataset for remote sens- ing image understanding
Xiang Li, Jian Ding, and Mohamed Elhoseiny. Vrsbench: A versatile vision-language benchmark dataset for remote sens- ing image understanding. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Bench- marks Track. 2, 4
-
[26]
Improved baselines with visual instruction tuning, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2024. 2, 6, 7
2024
-
[27]
Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 6, 7
2024
-
[28]
Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 1
2024
-
[29]
Nvila: Efficient frontier visual language models, 2024
Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yux- ian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vish- wesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, and Yao Lu. Nvila:...
2024
-
[30]
Yang Long, Gui-Song Xia, Shengyang Li, Wen Yang, Michael Ying Yang, Xiao Xiang Zhu, Liangpei Zhang, and Deren Li. On creating benchmark dataset for aerial image interpretation: Reviews, guidances, and million-aid.IEEE Journal of selected topics in applied earth observations and remote sensing, 14:4205–4230, 2021. 5
2021
-
[31]
Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,
-
[32]
Exploring models and data for remote sensing im- age caption generation.IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017
Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xue- long Li. Exploring models and data for remote sensing im- age caption generation.IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017. 2
2017
-
[33]
Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wang, Linlin Wang, Bo Dang, Jiangwei Lao, Jian Wang, Jing- dong Chen, Yihua Tan, and Yansheng Li. Skysensegpt: A fine-grained instruction tuning dataset and model for re- mote sensing vision-language understanding.arXiv preprint arXiv:2406.10100, 2024. 7
-
[34]
When large vision-language model meets large remote sensing imagery: Coarse-to-fine text-guided token pruning
Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, and Yansheng Li. When large vision-language model meets large remote sensing imagery: Coarse-to-fine text-guided token pruning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9206–9217, 2025. 2
2025
-
[35]
Levels of agi: Opera- tionalizing progress on the path to agi
Meredith Ringel Morris, Jascha Sohl-Dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Farabet, and Shane Legg. Levels of agi: Oper- ationalizing progress on the path to agi.arXiv preprint arXiv:2311.02462, 2023. 1, 2
-
[36]
Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model
Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model. In European Conference on Computer Vision, pages 440–457. Springer, 2024. 2, 4, 6, 7
2024
-
[37]
Gpt-4v system card
OpenAI. Gpt-4v system card. Technical report, OpenAI,
-
[39]
Hello gpt-4o
OpenAI. Hello gpt-4o. Technical report, OpenAI, 2024. Ac- cessed: 2025-02-26. 6, 7, 2
2024
-
[40]
Openai o1 system card
OpenAI. Openai o1 system card. Technical report, OpenAI,
-
[41]
Accessed: 2025-02-26. 6, 7
2025
-
[42]
Gpt-5 system card
OpenAI. Gpt-5 system card. Technical report, OpenAI,
-
[43]
Accessed on 2025-11-03. 7
2025
-
[44]
Vhm: Versatile and honest vision lan- guage model for remote sensing image analysis
Chao Pang, Xingxing Weng, Jiang Wu, Jiayu Li, Yi Liu, Ji- axing Sun, Weijia Li, Shuai Wang, Litong Feng, Gui-Song Xia, and Conghui He. Vhm: Versatile and honest vision lan- guage model for remote sensing image analysis. InAAAI conference on artificial intelligence, 2025. 2, 6, 7
2025
-
[45]
arXiv preprint arXiv:2412.15190 (2025),https://arxiv.org/abs/2412.151905
Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fa- had Shahbaz Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues.arXiv preprint arXiv:2412.15190, 2024. 1
-
[46]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Qwen2.5-vl, 2025
Qwen Team. Qwen2.5-vl, 2025. 6, 7
2025
-
[48]
Qwen3 technical report, 2025
Qwen Team. Qwen3 technical report, 2025. 6, 7
2025
-
[49]
Fengxiang Wang, Hongzhen Wang, Zonghao Guo, Di Wang, Yulin Wang, Mingshuo Chen, Qiang Ma, Long Lan, Wenjing Yang, Jing Zhang, et al. Xlrs-bench: Could your multimodal llms understand extremely large ultra-high-resolution remote sensing imagery? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14325–14336,
-
[50]
arXiv preprint arXiv:2110.08733
Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation.arXiv preprint arXiv:2110.08733, 2021. 5
-
[51]
Earthvqa: Towards queryable earth via re- lational reasoning-based remote sensing visual question an- swering
Junjue Wang, Zhuo Zheng, Zihang Chen, Ailong Ma, and Yanfei Zhong. Earthvqa: Towards queryable earth via re- lational reasoning-based remote sensing visual question an- swering. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5481–5489, 2024. 1, 2, 4
2024
-
[52]
Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024. 6, 7
2024
-
[53]
Cogvlm: Visual expert for pretrained language models, 2024
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models, 2024. 6, 7
2024
-
[54]
Mmlu-pro: A more robust and challenging multi-task language understanding benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. InThe Thirty-eight Conference on Neural Infor- mation Processing Systems Datasets and Benchmarks Track,
-
[55]
Skyscript: A large and seman- tically diverse vision-language dataset for remote sensing
Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, and Ram Rajagopal. Skyscript: A large and seman- tically diverse vision-language dataset for remote sensing. InProceedings of the AAAI Conference on Artificial Intel- ligence, pages 5805–5813, 2024. 2
2024
-
[56]
Sarlang-1m: A benchmark for vision-language modeling in sar image un- derstanding.IEEE Transactions on Geoscience and Remote Sensing, 2026
Yimin Wei, Aoran Xiao, Yexian Ren, Yuting Zhu, Hongruix- uan Chen, Junshi Xia, and Naoto Yokoya. Sarlang-1m: A benchmark for vision-language modeling in sar image un- derstanding.IEEE Transactions on Geoscience and Remote Sensing, 2026. 2
2026
-
[57]
Dota: A large-scale dataset for object detection in aerial images
Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Be- longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang- pei Zhang. Dota: A large-scale dataset for object detection in aerial images. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 5
2018
-
[58]
Foundation models for remote sensing and earth observation: A survey
Aoran Xiao, Weihao Xuan, Junjue Wang, Jiaxing Huang, Dacheng Tao, Shijian Lu, and Naoto Yokoya. Foundation models for remote sensing and earth observation: A survey. IEEE Geoscience and Remote Sensing Magazine, 2025. 2
2025
-
[59]
Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024
Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024. 1
2024
-
[60]
Exploring a fine-grained multiscale method for cross-modal remote sensing image re- trieval.IEEE Transactions on Geoscience and Remote Sens- ing, 60:3078451, 2022
Zhiqiang Yuan, Wenkai Zhang, Kun Fu, Xuan Li, Chubo Deng, Hongqi Wang, and Xian Sun. Exploring a fine-grained multiscale method for cross-modal remote sensing image re- trieval.IEEE Transactions on Geoscience and Remote Sens- ing, 60:3078451, 2022. 2
2022
-
[61]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556– 9567, 2024. 1, 2
2024
-
[62]
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi- discipline multimodal understanding benchmark.arXiv preprint arXiv:2409.02813, 2024. 2
work page internal anchor Pith review arXiv 2024
-
[63]
Skyeyegpt: Uni- fying remote sensing vision-language tasks via instruction tuning with large language model.ISPRS Journal of Pho- togrammetry and Remote Sensing, 221:64–77, 2025
Yang Zhan, Zhitong Xiong, and Yuan Yuan. Skyeyegpt: Uni- fying remote sensing vision-language tasks via instruction tuning with large language model.ISPRS Journal of Pho- togrammetry and Remote Sensing, 221:64–77, 2025. 2
2025
-
[64]
Good at captioning, bad at counting: Benchmarking gpt-4v on earth observation data
Chenhui Zhang and Sherrie Wang. Good at captioning, bad at counting: Benchmarking gpt-4v on earth observation data. In2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition Workshops (CVPRW), pages 7839–7849. IEEE, 2024. 4
2024
-
[65]
Lmms- eval: Reality check on the evaluation of large multimodal models, 2024
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms- eval: Reality check on the evaluation of large multimodal models, 2024. 6
2024
-
[66]
Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, et al. Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual in- put and output.arXiv preprint arXiv:2407.03320, 2024. 6, 7
-
[67]
Gme: Improving universal multimodal retrieval by multimodal llms, 2025
Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms, 2025. 5
2025
-
[68]
Agieval: A human-centric benchmark for evaluating foundation models, 2023
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023. 2
2023
-
[69]
Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling
Qi Zhu, Jiangwei Lao, Deyi Ji, Junwei Luo, Kang Wu, Yingying Zhang, Lixiang Ru, Jian Wang, Jingdong Chen, Ming Yang, Dong Liu, and Feng Zhao. Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 14733–14744, 2025. 2 1...
2025
-
[70]
RS” (Remote Sensing), “Ph
More Descriptions on GeoMMBench 7.1. Dimensions and Tasks in GeoMMBench Below we provide explanations for the abbreviations of evaluation dimensions in GeoMMBench, as listed in Tables 1 and 3 of the paper, along with their corresponding tasks. Disciplines:“RS” (Remote Sensing), “Ph.” (Photogramme- try), “GIS” (Geographic Information System), and “GNSS” (G...
-
[71]
Toolkit Library We present the tools integrated into GeoMMAgent
More Descriptions on GeoMMAgent 8.1. Toolkit Library We present the tools integrated into GeoMMAgent. As shown in Fig. 4 and Section 3 of the manuscript, the toolkit library is organized into four categories:general toolkit, knowledge toolkit,perception toolkit, andreasoning toolkit. GeoMMAgent is designed as a fully training free and exten- sible framewo...
-
[72]
The model recognizes 51 scene categories and land cover types, covering the major classes com- monly used in remote sensing scene understanding
dataset. The model recognizes 51 scene categories and land cover types, covering the major classes com- monly used in remote sensing scene understanding. The toolkit outputs top five predictions with confidence scores to support precise interpretation of scene semantics. •Detection model: We deploy a pre trained Yolo11 detec- tor [21] with backbone CSPNet...
-
[73]
Band 1⃝properties
dataset. It employs oriented bounding boxes to de- tect and localize diverse geospatial objects such as air- craft, vehicles, and buildings. The toolkit outputs object counts, spatial distributions, and detection reports that in- clude class labels and confidence values. •Segmentation model: We train a DeepLabv3 plus model with Xception backbone [6] on th...
-
[74]
Limitation Like any benchmark, GeoMMBench has limitations despite its comprehensive design. The manual curation process may introduce selection biases, and the chosen knowledge points, while diverse, cannot fully represent the complete breadth and depth required for evaluating an Expert AGI in geoscience and remote sensing. Even so, we argue that strong p...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.