Recognition: 2 theorem links
· Lean TheoremRemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs
Pith reviewed 2026-05-10 17:56 UTC · model grok-4.3
The pith
RemoteAgent uses reinforcement learning on a dataset of vague queries to align an MLLM as a cognitive core that handles image-level and sparse-region Earth Observation tasks itself while routing only dense predictions to external tools.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RemoteAgent is an agentic framework that respects the intrinsic capability boundaries of MLLMs. By constructing VagueEO and using it for reinforcement fine-tuning, the system turns an MLLM into a robust cognitive core that directly resolves image- and sparse region-level tasks, while orchestrating specialized tools via the Model Context Protocol exclusively for dense predictions.
What carries the argument
VagueEO-guided reinforcement fine-tuning that aligns an MLLM into a cognitive core deciding between internal resolution and Model Context Protocol tool orchestration.
If this is right
- RemoteAgent demonstrates robust recognition of vague human intents in Earth Observation scenarios.
- The framework achieves competitive performance on a range of image-, region-, and pixel-level EO tasks.
- Suitable tasks are processed internally, limiting tool calls to only those requiring dense spatial output.
- The Model Context Protocol enables selective orchestration of specialized tools without indiscriminate invocation.
Where Pith is reading between the lines
- The same reinforcement approach for learning self-limits could apply to MLLM agents in medical imaging or autonomous driving where query precision varies.
- Replacing simulated queries with logs of actual user interactions would test whether the learned boundaries generalize beyond the training distribution.
- Extending the protocol to more tool types might allow the core model to stay small while covering additional precision levels.
- The design illustrates a broader pattern: train models first to recognize what they cannot do well before building surrounding agent infrastructure.
Load-bearing premise
The VagueEO dataset of simulated vague queries accurately captures real user intents, and reinforcement learning can reliably teach the MLLM its own capability boundaries without post-hoc adjustments.
What would settle it
Run RemoteAgent on a fresh collection of real human-provided vague EO queries and measure whether its rate of correct internal handling plus overall task accuracy exceeds both a version that always delegates to tools and a version that always answers directly.
Figures
read the original abstract
Earth Observation (EO) systems are essentially designed to support domain experts who often express their requirements through vague natural language rather than precise, machine-friendly instructions. Depending on the specific application scenario, these vague queries can demand vastly different levels of visual precision. Consequently, a practical EO AI system must bridge the gap between ambiguous human queries and the appropriate multi-granularity visual analysis tasks, ranging from holistic image interpretation to fine-grained pixel-wise predictions. While Multi-modal Large Language Models (MLLMs) demonstrate strong semantic understanding, their text-based output format is inherently ill-suited for dense, precision-critical spatial predictions. Existing agentic frameworks address this limitation by delegating tasks to external tools, but indiscriminate tool invocation is computationally inefficient and underutilizes the MLLM's native capabilities. To this end, we propose RemoteAgent, an agentic framework that strategically respects the intrinsic capability boundaries of MLLMs. To empower this framework to understand real user intents, we construct VagueEO, a human-centric instruction dataset pairing EO tasks with simulated vague natural-language queries. By leveraging VagueEO for reinforcement fine-tuning, we align an MLLM into a robust cognitive core that directly resolves image- and sparse region-level tasks. Consequently, RemoteAgent processes suitable tasks internally while intelligently orchestrating specialized tools via the Model Context Protocol exclusively for dense predictions. Extensive experiments demonstrate that RemoteAgent achieves robust intent recognition capabilities while delivering highly competitive performance across diverse EO tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes RemoteAgent, an agentic framework for Multi-modal Large Language Models (MLLMs) in Earth Observation (EO). It introduces the VagueEO dataset, constructed by pairing EO tasks with simulated vague natural-language queries, and applies reinforcement fine-tuning to align an MLLM as a cognitive core. The framework claims to respect intrinsic MLLM capability boundaries by handling image- and sparse region-level tasks internally while delegating only dense predictions to specialized tools via the Model Context Protocol (MCP), achieving robust intent recognition and competitive performance on diverse EO tasks.
Significance. If the claims hold, this approach could meaningfully improve efficiency in EO systems by reducing unnecessary external tool calls and better exploiting native MLLM strengths for multi-granularity analysis. The VagueEO dataset and RL-based boundary alignment represent a concrete contribution to human-centric agentic designs in remote sensing, with potential applicability to other domains requiring selective delegation.
major comments (2)
- [Abstract and §3] Abstract and §3 (VagueEO Construction): The dataset is built by 'pairing EO tasks with simulated vague natural-language queries,' yet the manuscript supplies no quantitative validation (e.g., distributional statistics on temporal qualifiers, sensor references, or precision tolerances) showing that these simulations match real expert query distributions. This assumption is load-bearing for the central claim that RL fine-tuning produces reliable self-assessment of capability boundaries; divergence would cause the learned policy to misclassify tasks on authentic inputs.
- [§5] §5 (Experiments): The abstract asserts that 'extensive experiments demonstrate robust intent recognition capabilities while delivering highly competitive performance,' but the text provides no specific metrics, baselines, error bars, ablation studies, or details on how intent recognition accuracy versus task performance was quantified across EO scenarios. This prevents independent verification of the robustness and competitiveness claims.
minor comments (2)
- [§2] The Model Context Protocol (MCP) is referenced without an initial definition or citation; a short explanatory sentence or pointer to its specification would aid readers unfamiliar with the term.
- [Figures] Ensure any workflow diagrams clearly label the decision points where the MLLM elects internal processing versus MCP tool invocation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments point by point below and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (VagueEO Construction): The dataset is built by 'pairing EO tasks with simulated vague natural-language queries,' yet the manuscript supplies no quantitative validation (e.g., distributional statistics on temporal qualifiers, sensor references, or precision tolerances) showing that these simulations match real expert query distributions. This assumption is load-bearing for the central claim that RL fine-tuning produces reliable self-assessment of capability boundaries; divergence would cause the learned policy to misclassify tasks on authentic inputs.
Authors: We agree that explicit validation of the simulated queries is important to support the RL fine-tuning claims. The VagueEO queries were generated by EO domain experts to reflect typical vagueness, but the initial manuscript omitted comparative statistics. In revision we will add to §3 quantitative distributional comparisons (e.g., frequencies of temporal qualifiers, sensor references, and precision terms) against a held-out set of real expert queries, along with any noted limitations. revision: yes
-
Referee: [§5] §5 (Experiments): The abstract asserts that 'extensive experiments demonstrate robust intent recognition capabilities while delivering highly competitive performance,' but the text provides no specific metrics, baselines, error bars, ablation studies, or details on how intent recognition accuracy versus task performance was quantified across EO scenarios. This prevents independent verification of the robustness and competitiveness claims.
Authors: We acknowledge the experimental section lacked sufficient detail for independent verification. We will revise §5 to include concrete metrics (intent recognition accuracy and F1, task success rates), explicit baselines, error bars from repeated runs, ablation studies on RL and MCP components, and a clear protocol separating intent recognition evaluation from downstream task performance. revision: yes
Circularity Check
No significant circularity
full rationale
The paper constructs VagueEO by pairing EO tasks with simulated vague queries, then applies standard RL fine-tuning to align an MLLM for deciding internal handling versus tool delegation via Model Context Protocol. No equations, derivations, or self-referential definitions appear in the abstract or described claims that reduce any prediction or result to its inputs by construction. The approach relies on established RL/MLLM methods plus a new dataset without load-bearing self-citations, imported uniqueness theorems, or smuggled ansatzes. Central claims about intent recognition and capability boundaries remain independent of the inputs and are presented as empirically validated rather than tautological.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption MLLMs possess learnable intrinsic capability boundaries separating tasks they can resolve internally from those requiring external tools
- domain assumption VagueEO dataset of simulated vague queries paired with EO tasks is representative of real human intents
invented entities (2)
-
RemoteAgent
no independent evidence
-
VagueEO
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By leveraging VagueEO for reinforcement fine-tuning, we align an MLLM into a robust cognitive core that directly resolves image- and sparse region-level tasks... exclusively for dense predictions.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we utilize VagueEO for reinforcement fine-tuning... GRPO... unified multimodal reward
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Evaluating Remote Sensing Image Captions Beyond Metric Biases
Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA pe...
-
RemoteShield: Enable Robust Multimodal Large Language Models for Earth Observation
RemoteShield improves robustness of Earth observation MLLMs by training on semantic equivalence clusters of clean and perturbed inputs via preference learning to maintain consistent reasoning under noise.
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Jyoti Aneja, Harkirat Behl, S ´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024. 6
work page internal anchor Pith review arXiv 2024
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 4, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Whu-rs19 abzsl: An attribute-based dataset for remote sens- ing image understanding
Mattia Balestra, Marina Paolanti, and Roberto Pierdicca. Whu-rs19 abzsl: An attribute-based dataset for remote sens- ing image understanding. Remote Sensing, 17(14):2384,
-
[4]
Geoflow: Agentic workflow automation for geospatial tasks
Amulya Bhattaram, Justin Chung, Stanley Chung, Ranit Gupta, Janani Ramamoorthy, Kartikeya Gullapalli, Diana Marculescu, and Dimitrios Stamoulis. Geoflow: Agentic workflow automation for geospatial tasks. In Proceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems, pages 1150–1153, 2025. 2, 9
2025
-
[5]
Dual-tasks siamese transformer framework for building damage assessment
Hongruixuan Chen, Edoardo Nemni, Sofia Vallecorsa, Xi Li, Chen Wu, and Lars Bromley. Dual-tasks siamese transformer framework for building damage assessment. In IGARSS 2022-2022 IEEE international geoscience and remote sensing symposium, pages 1600–1603. IEEE, 2022. 8
2022
-
[6]
Zhengchao Chen, Haoran Wang, Jing Yao, Pedram Ghamisi, Jun Zhou, Peter M Atkinson, and Bing Zhang. Cangling- knowflow: A unified knowledge-and-flow-fused agent for comprehensive remote sensing applications. arXiv preprint arXiv:2512.15231, 2025. 2, 9
-
[7]
Anchor-free oriented proposal generator for object detection
Gong Cheng, Jiabao Wang, Ke Li, Xingxing Xie, Chunbo Lang, Yanqing Yao, and Junwei Han. Anchor-free oriented proposal generator for object detection. IEEE Transactions on Geoscience and Remote Sensing, 60:1–11, 2022. 7
2022
-
[8]
Fuse-rsvlm: Feature fusion vision-language model for remote sensing
Yunkai Dang, Donghao Wang, Jiacheng Yang, Yifan Jiang, Meiyi Zhu, Yuekun Yang, Cong Wang, Qi Fan, Wenbin Li, and Yang Gao. Fuse-rsvlm: Feature fusion vision-language model for remote sensing. arXiv preprint arXiv:2512.24022,
-
[9]
ToolRosella: Translating Code Repositories into Standardized Tools for Scientific Agents
Shimin Di, Xujie Yuan, Hanghui Guo, Chaoqian Ouyang, Zhangze Chen, Ling Yue, Libin Zheng, Jia Zhu, Shaowu Pan, Jian Yin, et al. Toolrosetta: Bridging open-source repos- itories and large language model agents through automated tool standardization. arXiv preprint arXiv:2603.09290, 2026. 2
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
Cross-modal bidi- rectional interaction model for referring remote sensing image segmentation,
Zhe Dong, Yuzhe Sun, Tianzhu Liu, Wangmeng Zuo, and Yanfeng Gu. Cross-modal bidirectional interaction model for referring remote sensing image segmentation. arXiv preprint arXiv:2410.08613, 2024. 8
-
[11]
Earth-agent: Unlocking the full landscape of earth observation with agents,
Peilin Feng, Zhutao Lv, Junyan Ye, Xiaolei Wang, Xin- jie Huo, Jinhua Yu, Wanghan Xu, Wenlong Zhang, Lei Bai, Conghui He, et al. Earth-agent: Unlocking the full landscape of earth observation with agents. arXiv preprint arXiv:2509.23141, 2025. 2, 8, 9
-
[12]
Geovlm-r1: Reinforcement fine-tuning for improved remote sensing reasoning
Mustansar Fiaz, Hiyam Debary, Paolo Fraccaro, Danda Paudel, Luc Van Gool, Fahad Khan, and Salman Khan. Geovlm-r1: Reinforcement fine-tuning for improved remote sensing reasoning. arXiv preprint arXiv:2509.25026, 2025. 9
-
[13]
Skysense: A multi-modal remote sens- ing foundation model towards universal interpretation for earth observation imagery
Xin Guo, Jiangwei Lao, Bo Dang, Yingying Zhang, Lei Yu, Lixiang Ru, Liheng Zhong, Ziyuan Huang, Kang Wu, Dingxiang Hu, et al. Skysense: A multi-modal remote sens- ing foundation model towards universal interpretation for earth observation imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27672–27683, 2024. 7
2024
-
[14]
Model context protocol (mcp): Landscape, security threats, and future research directions
Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. Model context protocol (mcp): Landscape, security threats, and future research directions. ACM Transactions on Software Engineering and Methodology, 2025. 2
2025
-
[15]
Rs-vheat: Heat conduc- tion guided efficient remote sensing foundation model
Huiyang Hu, Peijin Wang, Hanbo Bi, Boyuan Tong, Zhaozhi Wang, Wenhui Diao, Hao Chang, Yingchao Feng, Ziqi Zhang, Yaowei Wang, et al. Rs-vheat: Heat conduc- tion guided efficient remote sensing foundation model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9876–9887, 2025. 7
2025
-
[16]
Rsgpt: A remote sensing vision language model and benchmark
Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark. ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025. 1, 9
2025
-
[17]
Teochat: A large vision-language as- sistant for temporal earth observation data
Jeremy Andrew Irvin, Emily Ruoyu Liu, Joyce Chuyi Chen, Ines Dormoy, Jinyoung Kim, Samar Khanna, Zhuo Zheng, and Stefano Ermon. Teochat: A large vision-language as- sistant for temporal earth observation data. In International Conference on Learning Representations, 2025. 9
2025
-
[18]
Eaglevision: Object-level attribute multimodal llm for remote sensing, 2025
Hongxiang Jiang, Jihao Yin, Qixiong Wang, Jiaqi Feng, and Guo Chen. Eaglevision: Object-level attribute multimodal llm for remote sensing, 2025. 9
2025
-
[19]
Falcon: A remote sensing vision-language foundation model.arXiv preprint arXiv:2503.11070, 2025
Yao kelu, Xu Nuo, Yang Rong, Xu Yingying, Gao Zhuoyan, Kitrungrotsakul Titinunt, Ren yi, Zhang Pu, Wang Jin, Wei Ning, and Li Chao. Falcon: A remote sens- ing vision-language foundation model. arXiv preprint arXiv:2503.11070, 2025. 1, 6, 7, 9
-
[20]
Geochat: Grounded large vision-language model for remote sensing
Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27831– 27840, 2024. 1, 6, 7, 9
2024
-
[21]
Lisa: Reasoning seg- 10 mentation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning seg- 10 mentation via large language model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9579–9589, 2024. 8
2024
-
[22]
Mengcheng Lan, Chaofeng Chen, Jiaxing Xu, Zongrui Li, Yiping Ke, Xudong Jiang, Yingchen Yu, Yunqing Zhao, and Song Bai. Text4seg++: Advancing image segmen- tation via generative language modeling. arXiv preprint arXiv:2509.06321, 2025. 8
-
[23]
Object detection in optical remote sensing im- ages: A survey and a new benchmark
Ke Li, Gang Wan, Gong Cheng, Liqiu Meng, and Jun- wei Han. Object detection in optical remote sensing im- ages: A survey and a new benchmark. ISPRS journal of photogrammetry and remote sensing, 159:296–307, 2020. 7
2020
-
[24]
Unleashing channel potential: Space- frequency selection convolution for sar object detection
Ke Li, Di Wang, Zhangyuan Hu, Wenxuan Zhu, Shaofeng Li, and Quan Wang. Unleashing channel potential: Space- frequency selection convolution for sar object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17323–17332, 2024. 1
2024
-
[25]
Language-guided progressive attention for visual grounding in remote sensing images
Ke Li, Di Wang, Haojie Xu, Haodi Zhong, and Cong Wang. Language-guided progressive attention for visual grounding in remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024. 9
2024
-
[26]
Designing domain- specific agents via hierarchical task abstraction mechanism
Kaiyu Li, Jiayu Wang, Zhi Wang, Hui Qiao, Weizhan Zhang, Deyu Meng, and Xiangyong Cao. Designing domain- specific agents via hierarchical task abstraction mechanism. arXiv preprint arXiv:2511.17198, 2025. 9
-
[27]
Segearth-r1: Geospatial pixel reasoning via large language model
Kaiyu Li, Zepeng Xin, Li Pang, Chao Pang, Yupeng Deng, Jing Yao, Guisong Xia, Deyu Meng, Zhi Wang, and Xiangy- ong Cao. Segearth-r1: Geospatial pixel reasoning via large language model. arXiv preprint arXiv:2504.09644, 2025. 7, 8, 9
-
[28]
Rsvg-zeroov: Exploring a training-free framework for zero- shot open-vocabulary visual grounding in remote sensing im- ages
Ke Li, Di Wang, Ting Wang, Fuyu Dong, Yiming Zhang, Luyao Zhang, Xiangyu Wang, Shaofeng Li, and Quan Wang. Rsvg-zeroov: Exploring a training-free framework for zero- shot open-vocabulary visual grounding in remote sensing im- ages. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 6288–6296, 2026. 1
2026
-
[29]
Georeason: Aligning thinking and answering in remote sensing vision-language models via logical consis- tency reinforcement learning, 2026
Wenshuai Li, Xiantai Xiang, Zixiao Wen, Guangyao Zhou, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, and Yuxin Hu. Georeason: Aligning thinking and answering in remote sensing vision-language models via logical consis- tency reinforcement learning, 2026. 9
2026
-
[30]
Masked angle-aware autoen- coder for remote sensing images
Zhihao Li, Biao Hou, Siteng Ma, Zitong Wu, Xianpeng Guo, Bo Ren, and Licheng Jiao. Masked angle-aware autoen- coder for remote sensing images. In European Conference on Computer Vision, pages 260–278. Springer, 2024. 7
2024
-
[31]
Visual instruction tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023. 7
2023
-
[32]
Jiaqi Liu, Ronghao Fu, Lang Sun, Haoran Liu, Xiao Yang, Weipeng Zhang, Xu Na, Zhuoran Duan, and Bo Yang. Sky- moe: A vision-language foundation model for enhancing geospatial interpretation with mixture of experts. arXiv preprint arXiv:2512.02517, 2025. 6, 7
-
[33]
Jiaqi Liu, Lang Sun, Ronghao Fu, and Bo Yang. To- wards faithful reasoning in remote sensing: A perceptually- grounded geospatial chain-of-thought for vision-language models. arXiv preprint arXiv:2509.22221, 2025. 9
-
[34]
Rotated multi-scale interaction network for referring remote sensing image seg- mentation
Sihan Liu, Yiwei Ma, Xiaoqing Zhang, Haowei Wang, Ji- ayi Ji, Xiaoshuai Sun, and Rongrong Ji. Rotated multi-scale interaction network for referring remote sensing image seg- mentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26658– 26668, 2024. 8
2024
-
[35]
Xu Liu and Zhouhui Lian. Rsunivlm: A unified vision language model for remote sensing via granularity-oriented mixture of experts. arXiv preprint arXiv:2412.05679, 2024. 1, 7, 9
-
[36]
Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wang, Linlin Wang, Bo Dang, Jiangwei Lao, Jian Wang, Jingdong Chen, Yihua Tan, et al. Skysensegpt: A fine-grained in- struction tuning dataset and model for remote sensing vision- language understanding. arXiv preprint arXiv:2406.10100,
-
[37]
Ge- omag: A vision-language model for pixel-level fine-grained remote sensing image parsing
Xianzhi Ma, Jianhui Li, Changhua Pei, and Hao Liu. Ge- omag: A vision-language model for pixel-level fine-grained remote sensing image parsing. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 5441– 5450, 2025. 6, 8
2025
-
[38]
Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model
Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model. In European Conference on Computer Vision, pages 440–457. Springer, 2024. 1, 6, 7
2024
-
[39]
Chaoqian Ouyang, Ling Yue, Shimin Di, Libin Zheng, Linan Yue, Shaowu Pan, Jian Yin, and Min-Ling Zhang. Code2mcp: Transforming code repositories into mcp ser- vices. arXiv preprint arXiv:2509.05941, 2025. 2
-
[40]
Vhm: Versatile and honest vision language model for remote sensing image analysis
Chao Pang, Xingxing Weng, Jiang Wu, Jiayu Li, Yi Liu, Ji- axing Sun, Weijia Li, Shuai Wang, Litong Feng, Gui-Song Xia, et al. Vhm: Versatile and honest vision language model for remote sensing image analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 6381– 6388, 2025. 6, 7
2025
-
[41]
Deepspeed: System optimizations enable train- ing deep learning models with over 100 billion parame- ters
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable train- ing deep learning models with over 100 billion parame- ters. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505–3506, 2020. 6
2020
-
[42]
Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning
Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brock- man, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Candido, Matt Uyttendaele, and Trevor Darrell. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4088– 4099, 2023. 7
2023
-
[43]
Pixellm: Pixel reasoning with large multimodal model
Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024. 8
2024
-
[44]
Rs2- sam2: Customized sam2 for referring remote sensing image segmentation
Fu Rong, Meng Lan, Qian Zhang, and Lefei Zhang. Rs2- sam2: Customized sam2 for referring remote sensing image segmentation. arXiv preprint arXiv:2503.07266, 2025. 8 11
-
[45]
ThinkGeo: Evaluating tool-augmented agents for remote sensing tasks,
Akashah Shabbir, Muhammad Akhtar Munir, Akshay Dud- hane, Muhammad Umer Sheikh, Muhammad Haris Khan, Paolo Fraccaro, Juan Bernabe Moreno, Fahad Shahbaz Khan, and Salman Khan. Thinkgeo: Evaluating tool- augmented agents for remote sensing tasks. arXiv preprint arXiv:2505.23752, 2025. 9
-
[46]
arXiv preprint arXiv:2501.13925 URL:https://arxiv.org/abs/2501.13925
Akashah Shabbir, Mohammed Zumri, Mohammed Ben- namoun, Fahad S Khan, and Salman Khan. Geopixel: Pixel grounding large multimodal model in remote sensing. arXiv preprint arXiv:2501.13925, 2025. 8
-
[47]
Openearthagent: A unified framework for tool-augmented geospatial agents,
Akashah Shabbir, Muhammad Umer Sheikh, Muham- mad Akhtar Munir, Hiyam Debary, Mustansar Fiaz, Muham- mad Zaigham Zaheer, Paolo Fraccaro, Fahad Shahbaz Khan, Muhammad Haris Khan, Xiao Xiang Zhu, et al. Openeartha- gent: A unified framework for tool-augmented geospatial agents. arXiv preprint arXiv:2602.17665, 2026. 2, 9
-
[48]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Fully convolutional networks for dense se- mantic labelling of high-resolution aerial imagery
Jamie Sherrah. Fully convolutional networks for dense se- mantic labelling of high-resolution aerial imagery. arXiv preprint arXiv:1606.02585, 2016. 7
-
[50]
Earthdial: Turning multi-sensory earth observations to interactive dialogues
Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fa- had Shahbaz Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14303–14313, ...
2025
-
[51]
Geospatial foundation models: Recent advances and applications
Ranga Raju Vatsavai. Geospatial foundation models: Recent advances and applications. In Proceedings of the 12th ACM SIGSPATIALInternational Workshop on Analytics for Big Geospatial Data, pages 30–33, 2024. 7
2024
-
[52]
Geozero: Incentivizing reasoning from scratch on geospatial scenes, 2025
Di Wang, Shunyu Liu, Wentao Jiang, Fengxiang Wang, Yi Liu, Xiaolei Qin, Zhiming Luo, Chaoyang Zhou, Haonan Guo, Jing Zhang, Bo Du, Dacheng Tao, and Liangpei Zhang. Geozero: Incentivizing reasoning from scratch on geospatial scenes, 2025. 9
2025
-
[53]
Pcdasnet: Position-constrained differential attention siamese network for building damage assessment
Jiaqi Wang, Haonan Guo, Xin Su, Li Zheng, and Qiangqiang Yuan. Pcdasnet: Position-constrained differential attention siamese network for building damage assessment. IEEE Transactions on Geoscience and Remote Sensing, 62:1–18,
-
[54]
Learning to compose for cross-domain agentic workflow generation
Jialiang Wang, Shengxiang Xu, Hanmo Liu, Jiachuan Wang, Yuyu Luo, Shimin Di, Min-Ling Zhang, and Lei Chen. Learning to compose for cross-domain agentic workflow generation. arXiv preprint arXiv:2602.11114, 2026. 2
-
[55]
Ringmogpt: A unified re- mote sensing foundation model for vision, language, and grounded tasks
Peijin Wang, Huiyang Hu, Boyuan Tong, Ziqi Zhang, Fan- glong Yao, Yingchao Feng, Zining Zhu, Hao Chang, Wen- hui Diao, Qixiang Ye, et al. Ringmogpt: A unified re- mote sensing foundation model for vision, language, and grounded tasks. IEEE Transactions on Geoscience and Remote Sensing, 2024. 9
2024
-
[56]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
isaid: A large- scale dataset for instance segmentation in aerial images
Syed Waqas Zamir, Aditya Arora, Akshita Gupta, Salman Khan, Guolei Sun, Fahad Shahbaz Khan, Fan Zhu, Ling Shao, Gui-Song Xia, and Xiang Bai. isaid: A large- scale dataset for instance segmentation in aerial images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 28–37,
-
[58]
Vision- language modeling meets remote sensing: Models, datasets, and perspectives
Xingxing Weng, Chao Pang, and Gui-Song Xia. Vision- language modeling meets remote sensing: Models, datasets, and perspectives. IEEE Geoscience and Remote Sensing Magazine, 2025. 1
2025
-
[59]
On the generalization of sft: A reinforcement learning perspective with reward rectification, 2026
Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, and Xu Yang. On the generalization of sft: A re- inforcement learning perspective with reward rectification. arXiv preprint arXiv:2508.05629, 2025. 4
-
[60]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302, 2024. 7
work page internal anchor Pith review arXiv 2024
-
[61]
Aid: A benchmark data set for performance evaluation of aerial scene classification
Gui-Song Xia, Jingwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu. Aid: A benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing, 55(7):3965–3981, 2017. 6
2017
-
[62]
Dota: A large-scale dataset for object detection in aerial images
Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Be- longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang- pei Zhang. Dota: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3974–3983,
-
[63]
Florence-2: Advancing a unified representation for a variety of vision tasks
Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4818– 4829, 2024. 7
2024
-
[64]
Segearth-r2: Towards comprehensive language- guided segmentation for remote sensing images
Zepeng Xin, Kaiyu Li, Luodi Chen, Wanchen Li, Yuchen Xiao, Hui Qiao, Weizhan Zhang, Deyu Meng, and Xiangy- ong Cao. Segearth-r2: Towards comprehensive language- guided segmentation for remote sensing images. arXiv preprint arXiv:2512.20013, 2025. 8
-
[65]
Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assess- ment
Lingling Xu, Haoran Xie, S Joe Qin, Xiaohui Tao, and Fu Lee Wang. Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assess- ment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026. 2
2026
-
[66]
Robustflow: Towards robust agentic workflow generation.arXiv preprint arXiv:2509.21834, 2025
Shengxiang Xu, Jiayi Zhang, Shimin Di, Yuyu Luo, Liang Yao, Hanmo Liu, Jia Zhu, Fan Liu, and Min-Ling Zhang. Robustflow: Towards robust agentic workflow generation. arXiv preprint arXiv:2509.21834, 2025. 2
-
[67]
RS-Agent: Automating remote sensing tasks through intelligent agent,
Wenjia Xu, Zijian Yu, Boyang Mu, Zhiwei Wei, Yuanben Zhang, Guangzuo Li, Jiuniu Wang, and Mugen Peng. Rs- 12 agent: Automating remote sensing tasks through intelligent agent. arXiv preprint arXiv:2406.07089, 2024. 9
-
[68]
Lavt: Language- aware vision transformer for referring image segmentation
Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Heng- shuang Zhao, and Philip HS Torr. Lavt: Language- aware vision transformer for referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18155–18165, 2022. 8
2022
-
[69]
Remotesam: Towards segment anything for earth observa- tion
Liang Yao, Fan Liu, Delong Chen, Chuanyi Zhang, Yijun Wang, Ziyun Chen, Wei Xu, Shimin Di, and Yuhui Zheng. Remotesam: Towards segment anything for earth observa- tion. arXiv preprint arXiv:2505.18022, 2025. 1, 7
-
[70]
Uemm-air: Enable uavs to undertake more multi-modal tasks
Liang Yao, Fan Liu, Shengxiang Xu, Chuanyi Zhang, Shimin Di, Xing Ma, Jianyu Jiang, Zequan Wang, and Jun Zhou. Uemm-air: Enable uavs to undertake more multi-modal tasks. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 12792–12798, 2025. 1
2025
-
[71]
Re- motereasoner: Towards unifying geospatial reasoning work- flow
Liang Yao, Fan Liu, Hongbo Lu, Chuanyi Zhang, Rui Min, Shengxiang Xu, Shimin Di, and Pai Peng. Re- motereasoner: Towards unifying geospatial reasoning work- flow. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 11883–11891, 2026. 2, 4, 7, 9
2026
-
[72]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022. 8
2022
-
[73]
Rrsis: Referring remote sensing image seg- mentation
Zhenghang Yuan, Lichao Mou, Yuansheng Hua, and Xiao Xiang Zhu. Rrsis: Referring remote sensing image seg- mentation. IEEE Transactions on Geoscience and Remote Sensing, 62:1–12, 2024. 8
2024
-
[74]
Rsvg: Exploring data and models for visual grounding on remote sensing data
Yang Zhan, Zhitong Xiong, and Yuan Yuan. Rsvg: Exploring data and models for visual grounding on remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 61: 1–13, 2023. 6
2023
-
[75]
Skyeyegpt: Unifying remote sensing vision-language tasks via instruc- tion tuning with large language model
Yang Zhan, Zhitong Xiong, and Yuan Yuan. Skyeyegpt: Unifying remote sensing vision-language tasks via instruc- tion tuning with large language model. ISPRS Journal of Photogrammetry and Remote Sensing, 221:64–77, 2025. 6, 9
2025
-
[76]
Next-chat: An lmm for chat, detection and segmentation
Ao Zhang, Yuan Yao, Wei Ji, Zhiyuan Liu, and Tat-Seng Chua. Next-chat: An lmm for chat, detection and segmenta- tion. arXiv preprint arXiv:2311.04498, 2023. 8
-
[77]
Vision-language models for vision tasks: A survey
Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5625–5644, 2024. 1
2024
-
[78]
Earthmarker: A visual prompting multi-modal large language model for remote sensing
Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, Jun Li, and Xuerui Mao. Earthmarker: A visual prompting multi-modal large language model for remote sensing. IEEE Transactions on Geoscience and Remote Sensing, 2024. 9
2024
-
[79]
Earthgpt: A universal multimodal large lan- guage model for multisensor image comprehension in re- mote sensing domain
Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. Earthgpt: A universal multimodal large lan- guage model for multisensor image comprehension in re- mote sensing domain. IEEE Transactions on Geoscience and Remote Sensing, 62:1–20, 2024. 9
2024
-
[80]
Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection
Yuanlin Zhang, Yuan Yuan, Yachuang Feng, and Xiaoqiang Lu. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection. IEEE Transactions on Geoscience and Remote Sensing, 57 (8):5535–5548, 2019. 7
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.