pith. machine review for the scientific record. sign in

arxiv: 2605.14475 · v1 · submitted 2026-05-14 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords active perceptionremote sensingultra-high-resolutionvision-language modelsplanningevidence trackingtrajectory corpusGRPO
0
0 comments X

The pith

GeoVista builds a global exploration plan then performs branch-wise inspections while tracking evidence to interpret ultra-high-resolution remote sensing images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes GeoVista to overcome the limits of single-path zooming in large remote sensing scenes, where sparse tiny evidence often gets missed or counted multiple times. It first creates an overall exploration plan across the image, then inspects candidate regions in separate branches while keeping an explicit record of found evidence for aggregation and de-duplication. Training uses the APEX-GRO corpus to turn tasks into Global-Region-Object reasoning with a scale-invariant spatial format, plus an Observe-Plan-Track loop aligned by GRPO with rewards for planning, localization, and answer quality. Results on three benchmarks show improved performance over prior methods that lack this structured multi-branch approach.

Core claim

GeoVista is a planning-driven active perception framework that first builds a global exploration plan, then verifies multiple candidate regions through branch-wise local inspection while maintaining an explicit evidence state for cross-region aggregation and de-duplication. It introduces APEX-GRO, a cold-start supervised trajectory corpus that reformulates diverse UHR tasks as Global-Region-Object interactive reasoning processes with a unified, scale-invariant spatial representation. The model uses an Observe-Plan-Track mechanism and aligns with a GRPO-based strategy using step-wise rewards for planning, localization, and final answer correctness.

What carries the argument

The Observe-Plan-Track mechanism for global observation, adaptive region inspection, and evidence tracking, enabled by the APEX-GRO trajectory corpus and GRPO alignment with step-wise rewards.

If this is right

  • Supports cross-region aggregation of findings while avoiding repeated counts of the same evidence.
  • Enables scale-invariant interactive reasoning from global scene to individual objects in one unified format.
  • Trains models to produce step-wise rewards for better planning and localization quality.
  • Delivers state-of-the-art results on RSHR-Bench, XLRS-Bench, and LRS-VQA through structured exploration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The explicit evidence state could reduce duplication errors when counting small objects scattered over very large areas.
  • Branch-wise inspection after a global plan might generalize to other large-image search tasks such as anomaly detection in aerial surveys.
  • The trajectory corpus approach could support training for similar active perception in domains beyond remote sensing.
  • Parallel branch execution might further improve efficiency once the global plan identifies independent regions.

Load-bearing premise

A global exploration plan followed by branch-wise local inspection with explicit evidence state maintenance will reliably cover sparse tiny evidence across large scenes without losing context or causing duplication.

What would settle it

A direct test showing single-path sequential exploration without global planning or evidence tracking matches or exceeds GeoVista accuracy on RSHR-Bench, XLRS-Bench, and LRS-VQA would challenge the claimed necessity of the multi-branch approach.

Figures

Figures reproduced from arXiv: 2605.14475 by Bo Yang, Haoran Liu, Jiasen Hu, Jiashun Zhu, Lang Sun, Nachuan Xing, Ronghao Fu, Weijie Zhang, Weipeng Zhang, Xiao Yang, Xu Na, Zhiheng Xue, Zhiwen Lin.

Figure 1
Figure 1. Figure 1: Comparison of active perception paradigms and performance. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance comparison of different models on XLRS-Bench. The blue bars represent the Mean Accuracy, while the red dashed line indi￾cates the Average Turns required for each method. This motivates a central question: how can a model decide where to look, at what scale to inspect, and when to stop, while coordinating global search, local verification, and evidence aggregation? To answer this question, we pr… view at source ↗
Figure 3
Figure 3. Figure 3: Data construction pipeline for APEX-GRO. The process consists of four stages: data col￾lection, context construction, multi-turn execution, and quality control. Detailed prompt instructions are provided in Appendix A. 3.1 APEX-GRO: Cross-Scale Interleaved Trajectory Construction UHR remote sensing interpretation often requires more than a single image-question-answer pair: a model must decide where to insp… view at source ↗
Figure 4
Figure 4. Figure 4: Training pipeline of GeoVista. Stage I performs supervised fine-tuning on APEX-GRO trajectories. Stage II applies GRPO-based alignment on verifiable grounding and counting tasks. The plan reward in the figure is a simplified illustration; the exact state-machine formulation is described in the text. Stage II: GRPO-Based Agentic Alignment. SFT teaches the model how to imitate reference trajectories, but it … view at source ↗
Figure 5
Figure 5. Figure 5: Performance on UHR datasets. Black values represent our final scores across diverse sub-tasks, and red values highlight absolute improvements over baselines. As detailed in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training dynamics of the GRPO alignment phase. We track the aggregated reward, tool usage, observation length, response length, and downstream XLRS-Bench performance across training. Starting from the APEX-GRO SFT checkpoint, GRPO alignment improves XLRS-Bench performance from 40.53 to 52.78. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The visualization of examples from APEX-GRO dataset. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Execution trajectory for the counting task. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Execution trajectory for the visual grounding task. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Execution trajectory for the route planning task. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Execution trajectory for the spatial relationship task. [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
read the original abstract

Interpreting ultra-high-resolution (UHR) remote sensing images requires models to search for sparse and tiny visual evidence across large-scale scenes. Existing remote sensing vision-language models can inspect local regions with zooming and cropping tools, but most exploration strategies follow either a one-shot focus or a single sequential trajectory. Such single-path exploration can lose global context, leave scattered regions unvisited, and revisit or count the same evidence multiple times. To this end, we propose GeoVista, a planning-driven active perception framework for UHR remote sensing interpretation. Instead of committing to one zooming path, GeoVista first builds a global exploration plan, then verifies multiple candidate regions through branch-wise local inspection, while maintaining an explicit evidence state for cross-region aggregation and de-duplication. To enable this behavior, we introduce APEX-GRO, a cold-start supervised trajectory corpus that reformulates diverse UHR tasks as Global-Region-Object interactive reasoning processes with a unified, scale-invariant spatial representation. We further design an Observe-Plan-Track mechanism for global observation, adaptive region inspection, and evidence tracking, and align the model with a GRPO-based strategy using step-wise rewards for planning, localization, and final answer correctness. Experiments on RSHR-Bench, XLRS-Bench, and LRS-VQA show that GeoVista achieves state-of-the-art performance. Code and dataset are available at https://github.com/ryan6073/GeoVista

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces GeoVista, a planning-driven active perception framework for interpreting ultra-high-resolution remote sensing images. It first constructs a global exploration plan, then performs branch-wise local inspection while maintaining an explicit evidence state for cross-region aggregation and de-duplication. This is enabled by the APEX-GRO cold-start trajectory corpus (reformulating tasks as Global-Region-Object reasoning), an Observe-Plan-Track mechanism, and GRPO-based step-wise reward alignment for planning, localization, and answer correctness. The central claim is that this approach achieves state-of-the-art performance on RSHR-Bench, XLRS-Bench, and LRS-VQA.

Significance. If the results hold, the work could meaningfully advance active perception for remote sensing by addressing single-path exploration failures on sparse tiny evidence in large scenes. The APEX-GRO corpus and GRPO alignment provide concrete, reproducible resources that could support further research on scale-invariant spatial reasoning and evidence tracking.

major comments (2)
  1. [Abstract] Abstract: The assertion of state-of-the-art performance on RSHR-Bench, XLRS-Bench, and LRS-VQA is presented without any quantitative metrics, tables, ablation studies, or error analysis, which is load-bearing for verifying whether gains arise from the Observe-Plan-Track mechanism rather than model capacity or benchmark artifacts.
  2. [Observe-Plan-Track mechanism] Observe-Plan-Track mechanism: The claim that global planning plus branch-wise inspection with explicit evidence state reliably avoids context loss, revisits, and missed regions on sparse targets depends on the effectiveness of APEX-GRO and GRPO alignment, yet the manuscript supplies no targeted validation, failure-case analysis, or comparison to single-path baselines that would substantiate this attribution.
minor comments (1)
  1. [Code and dataset availability] The GitHub link for code and dataset is provided, but the manuscript lacks any description of experimental setup details, hyperparameter choices, or reproducibility instructions that would allow independent verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the abstract and provide additional targeted validation for the Observe-Plan-Track mechanism.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion of state-of-the-art performance on RSHR-Bench, XLRS-Bench, and LRS-VQA is presented without any quantitative metrics, tables, ablation studies, or error analysis, which is load-bearing for verifying whether gains arise from the Observe-Plan-Track mechanism rather than model capacity or benchmark artifacts.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. The full manuscript contains detailed tables, ablations, and error analysis in Sections 4 and 5, but the abstract currently states the SOTA claim without specific numbers. In the revision we will update the abstract to report the main accuracy gains on each benchmark (e.g., absolute improvements over the strongest baselines) so readers can immediately assess the contribution. revision: yes

  2. Referee: [Observe-Plan-Track mechanism] Observe-Plan-Track mechanism: The claim that global planning plus branch-wise inspection with explicit evidence state reliably avoids context loss, revisits, and missed regions on sparse targets depends on the effectiveness of APEX-GRO and GRPO alignment, yet the manuscript supplies no targeted validation, failure-case analysis, or comparison to single-path baselines that would substantiate this attribution.

    Authors: We acknowledge the value of more explicit attribution. The current manuscript already reports comparisons against single-path baselines and component ablations in Sections 4.3–4.4. To directly address the request for targeted validation, we will add a dedicated subsection containing (1) quantitative metrics on revisit rate and region coverage, (2) failure-case analysis with qualitative examples of context loss in single-path runs, and (3) side-by-side evidence-tracking visualizations. These additions will make the contribution of the Observe-Plan-Track mechanism clearer. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new framework components are independently constructed

full rationale

The paper proposes GeoVista as a novel planning-driven active perception framework that first builds a global exploration plan then performs branch-wise local inspection with explicit evidence state maintenance. This is enabled by the newly introduced APEX-GRO cold-start corpus (reformulating tasks as Global-Region-Object reasoning) and an Observe-Plan-Track mechanism aligned via GRPO step-wise rewards. All core claims rest on these new constructions plus empirical SOTA results on RSHR-Bench, XLRS-Bench, and LRS-VQA rather than any derivation, equation, or parameter fit that reduces to the inputs by construction. No self-citations are load-bearing, no uniqueness theorems are imported from the authors' prior work, and no ansatz or known result is smuggled in via citation. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the effectiveness of the newly introduced Observe-Plan-Track mechanism and GRPO training strategy, with the APEX-GRO dataset serving as the key training resource; no explicit free parameters are named but implicit hyperparameters exist in the reward design and trajectory generation.

axioms (1)
  • domain assumption Vision-language models can be effectively aligned using step-wise rewards for planning, localization, and answer correctness in remote sensing tasks.
    Invoked in the GRPO-based alignment strategy described in the abstract.
invented entities (3)
  • GeoVista framework no independent evidence
    purpose: Planning-driven active perception for UHR remote sensing understanding
    Core system proposed in the paper.
  • APEX-GRO dataset no independent evidence
    purpose: Cold-start supervised trajectory corpus reformulating UHR tasks as Global-Region-Object reasoning
    Newly introduced training resource.
  • Observe-Plan-Track mechanism no independent evidence
    purpose: Global observation, adaptive region inspection, and evidence tracking
    Newly designed process for the framework.

pith-pipeline@v0.9.0 · 5602 in / 1486 out tokens · 41034 ms · 2026-05-15T02:36:57.241165+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · 15 internal anchors

  1. [1]

    Towards large-scale small object detection: Survey and benchmarks.IEEE transactions on pattern analysis and machine intelligence, 45(11):13467–13488, 2023

    Gong Cheng, Xiang Yuan, Xiwen Yao, Kebing Yan, Qinghua Zeng, Xingxing Xie, and Junwei Han. Towards large-scale small object detection: Survey and benchmarks.IEEE transactions on pattern analysis and machine intelligence, 45(11):13467–13488, 2023

  2. [2]

    Star: A first-ever dataset and a large-scale benchmark for scene graph generation in large-size satellite imagery.IEEE Trans

    Yansheng Li, Linlin Wang, Tingzhu Wang, Xue Yang, Junwei Luo, Qi Wang, Youming Deng, Wenbin Wang, Xian Sun, Haifeng Li, et al. Star: A first-ever dataset and a large-scale benchmark for scene graph generation in large-size satellite imagery.IEEE Trans. Pattern Anal. Mach. Intell., 47(3):1832–1849, 2025

  3. [3]

    When large vision-language model meets large remote sensing imagery: Coarse- to-fine text-guided token pruning.ArXiv, abs/2503.07588, 2025

    Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, and Yansheng Li. When large vision-language model meets large remote sensing imagery: Coarse- to-fine text-guided token pruning.ArXiv, abs/2503.07588, 2025

  4. [4]

    Geoeyes: On-demand visual focusing for evidence-grounded understanding of ultra-high-resolution remote sensing imagery.arXiv preprint arXiv:2602.14201, 2026

    Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yifan Zhang, Long Lan, Xue Yang, Hongda Sun, Yulin Wang, Di Wang, et al. Geoeyes: On-demand visual focusing for evidence-grounded understanding of ultra-high-resolution remote sensing imagery.arXiv preprint arXiv:2602.14201, 2026

  5. [5]

    Geollava-8k: Scaling remote-sensing multimodal large language models to 8k resolution

    Fengxiang Wang, Mingshuo Chen, Yueying Li, Di Wang, Haotian Wang, Zonghao Guo, Zefan Wang, Boqi Shan, Long Lan, Yulin Wang, Hongzhen Wang, Wenjing Yang, Bo Du, and Jing Zhang. Geollava-8k: Scaling remote-sensing multimodal large language models to 8k resolution. ArXiv, abs/2505.21375, 2025

  6. [6]

    Zoomearth: Active perception for ultra-high-resolution geospatial vision-language tasks.arXiv preprint arXiv:2511.12267, 2025

    Ruixun Liu, Bowen Fu, Jiayi Song, Kaiyu Li, Wanchen Li, Lanxuan Xue, Hui Qiao, Weizhan Zhang, Deyu Meng, and Xiangyong Cao. Zoomearth: Active perception for ultra-high-resolution geospatial vision-language tasks.arXiv preprint arXiv:2511.12267, 2025

  7. [7]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  8. [8]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  9. [9]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.ArXiv, abs/2304.08485, 2023

  10. [10]

    Improved baselines with visual instruction tuning.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26286–26296, 2023

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26286–26296, 2023

  11. [11]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.ArXiv, abs/2408.03326, 2024

  12. [12]

    Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

  13. [13]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Ke-Yang Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.ArXiv, abs/2409.12191, 2024

  14. [14]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. A...

  15. [15]

    Rsgpt: A remote sensing vision language model and benchmark.ArXiv, abs/2307.15266, 2023

    Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark.ArXiv, abs/2307.15266, 2023

  16. [16]

    Skyeyegpt: Unifying remote sensing vision- language tasks via instruction tuning with large language model.ArXiv, abs/2401.09712, 2024

    Yangfan Zhan, Zhitong Xiong, and Yuan Yuan. Skyeyegpt: Unifying remote sensing vision- language tasks via instruction tuning with large language model.ArXiv, abs/2401.09712, 2024

  17. [17]

    Khan, and Fahad Shahbaz Khan

    Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman H. Khan, and Fahad Shahbaz Khan. Geochat:grounded large vision-language model for remote sensing. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27831–27840, 2023

  18. [18]

    Earthmind: Leveraging cross-sensor data for advanced earth observation interpretation with a unified multimodal llm, 2025

    Yan Shu, Bin Ren, Zhitong Xiong, Danda Pani Paudel, Luc Van Gool, Begüm Demir, Nicu Sebe, and Paolo Rota. Earthmind: Leveraging cross-sensor data for advanced earth observation interpretation with a unified multimodal llm, 2025

  19. [19]

    Earthvl: A progressive earth vision-language understanding and generation framework.ArXiv, abs/2601.02783, 2026

    Junjue Wang, Yanfei Zhong, Zihang Chen, Zhuo Zheng, Ailong Ma, and Liangpei Zhang. Earthvl: A progressive earth vision-language understanding and generation framework.ArXiv, abs/2601.02783, 2026

  20. [20]

    Zhao yu Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, Linjie Li, Yu Cheng, Heng Ji, Junxian He, and Yi R. Fung. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers. ArXiv, abs/2506.23918, 2025

  21. [21]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

  22. [22]

    Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

    Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, et al. Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

  23. [23]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

  24. [24]

    MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning

    Zheng Jiang, Heng Guo, Chengyu Fang, Changchen Xiao, Xinyang Hu, Lifeng Sun, and Minfeng Xu. Medvr: Annotation-free medical visual reasoning via agentic reinforcement learning.arXiv preprint arXiv:2604.08203, 2026

  25. [25]

    Rsvqa: Visual question answering for remote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12):8555– 8566, 2020

    Sylvain Lobry, Diego Marcos, Jesse Murray, and Devis Tuia. Rsvqa: Visual question answering for remote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12):8555– 8566, 2020

  26. [26]

    Earthvqa: Towards queryable earth via relational reasoning-based remote sensing visual question answering

    Junjue Wang, Zhuo Zheng, Zihang Chen, Ailong Ma, and Yanfei Zhong. Earthvqa: Towards queryable earth via relational reasoning-based remote sensing visual question answering. In Proceedings of the AAAI conference on artificial intelligence, pages 5481–5489, 2024

  27. [27]

    Dota: A large-scale dataset for object detection in aerial images

    Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liangpei Zhang. Dota: A large-scale dataset for object detection in aerial images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3974–3983, 2018

  28. [28]

    Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 184:116–130, 2022

    Xian Sun, Peijin Wang, Zhiyuan Yan, Feng Xu, Ruiping Wang, Wenhui Diao, Jin Chen, Jihao Li, Yingchao Feng, Tao Xu, et al. Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 184:116–130, 2022

  29. [29]

    Aid: A benchmark data set for performance evaluation of aerial scene classification.IEEE Transactions on Geoscience and Remote Sensing, 55(7):3965–3981, 2017

    Gui-Song Xia, Jingwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu. Aid: A benchmark data set for performance evaluation of aerial scene classification.IEEE Transactions on Geoscience and Remote Sensing, 55(7):3965–3981, 2017. 11

  30. [30]

    Bag-of-visual-words and spatial extensions for land-use clas- sification

    Yi Yang and Shawn Newsam. Bag-of-visual-words and spatial extensions for land-use clas- sification. InProceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems, pages 270–279, 2010

  31. [31]

    Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

    Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

  32. [32]

    Satellite image classification via two-layer sparse coding with biased image representation.IEEE Geoscience and remote sensing letters, 8(1):173–176, 2010

    Dengxin Dai and Wen Yang. Satellite image classification via two-layer sparse coding with biased image representation.IEEE Geoscience and remote sensing letters, 8(1):173–176, 2010

  33. [33]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  34. [34]

    Introducing claude 4

    Anthropic. Introducing claude 4. Technical report, Anthropic, 2025

  35. [35]

    Llava-uhd v3: Progressive visual compression for efficient native-resolution encoding in mllms.ArXiv, abs/2511.21150, 2025

    Shichu Sun, Yichen Zhang, Haolin Song, Zonghao Guo, Chi Chen, Yidan Zhang, Yuan Yao, Zhiyuan Liu, and Maosong Sun. Llava-uhd v3: Progressive visual compression for efficient native-resolution encoding in mllms.ArXiv, abs/2511.21150, 2025

  36. [36]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision- language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024

  37. [37]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024

  38. [38]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  39. [39]

    Vhm: Versatile and honest vision language model for remote sensing image analysis

    Chao Pang, Xingxing Weng, Jiang Wu, Jiayu Li, Yi Liu, Jiaxing Sun, Weijia Li, Shuai Wang, Litong Feng, Gui-Song Xia, et al. Vhm: Versatile and honest vision language model for remote sensing image analysis. InProceedings of the AAAI Conference on Artificial Intelligence, pages 6381–6388, 2025

  40. [40]

    Earthdial: Turning multi-sensory earth observations to interactive dialogues

    Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shah- baz Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14303–14313, 2025

  41. [41]

    When large vision-language model meets large remote sensing imagery: Coarse- to-fine text-guided token pruning

    Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, and Yansheng Li. When large vision-language model meets large remote sensing imagery: Coarse- to-fine text-guided token pruning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9206–9217, 2025

  42. [42]

    Look where it matters: Training-free ultra-hr remote sensing vqa via adaptive zoom search.ArXiv, abs/2511.20460, 2025

    Yunqi Zhou, Chengjie Jiang, Chun Yuan, and Jing Li. Look where it matters: Training-free ultra-hr remote sensing vqa via adaptive zoom search.ArXiv, abs/2511.20460, 2025

  43. [43]

    Asking like Socrates: Socrates helps VLMs understand remote sensing images

    Run Shao, Ziyu Li, Zhaoyang Zhang, Linrui Xu, Xinran He, Hongyuan Yuan, Bolei He, Yongxing Dai, Yiming Yan, Yijun Chen, et al. Asking like socrates: Socrates helps vlms understand remote sensing images.arXiv preprint arXiv:2511.22396, 2025

  44. [44]

    Qwen-agent cookbook: Thinking with images

    Qwen Team. Qwen-agent cookbook: Thinking with images. https://github.com/QwenLM/ Qwen-Agent/blob/main/examples/cookbook_think_with_images.ipynb, 2025. Ac- cessed: 2025-09-23

  45. [45]

    Fengxiang Wang, Hongzhen Wang, Zonghao Guo, Di Wang, Yulin Wang, Mingshuo Chen, Qiang Ma, Long Lan, Wenjing Yang, Jing Zhang, et al. Xlrs-bench: Could your multimodal llms understand extremely large ultra-high-resolution remote sensing imagery? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14325–14336, 2025. 12

  46. [46]

    A benchmark for ultra-high-resolution remote sensing mllms.arXiv preprint arXiv:2512.17319, 2025

    Yunkai Dang, Meiyi Zhu, Donghao Wang, Yizhuo Zhang, Jiacheng Yang, Qi Fan, Yuekun Yang, Wenbin Li, Feng Miao, and Yang Gao. A benchmark for ultra-high-resolution remote sensing mllms.arXiv preprint arXiv:2512.17319, 2025

  47. [47]

    Wordnet: a lexical database for english.Communications of the ACM, 38(11):39–41, 1995

    George A Miller. Wordnet: a lexical database for english.Communications of the ACM, 38(11):39–41, 1995. A Data Construction Pipeline A.1 APEX-GRO Fine-grained Annotation Data Construction Pipeline To generate high-quality, leakage-free, end-to-end reasoning trajectories, APEX-GRO employs an automated and human-in-the-loop pipeline comprising four core mod...

  48. [48]

    Data Collection & Difficulty Assessment The pipeline first collects samples from multi-source remote sensing datasets (e.g., DOTA-v1.5, FAIR1M, EarthVQA), covering diverse tasks such as counting, visual grounding, route planning, classification, and spatial relationships. To ensure data quality, the system introduces a Difficulty Assessment mechanism, cat...

  49. [49]

    This engine fuses three major input streams: • Trajectory Compiler:Converts raw coordinates into a Nested Checklist to provide structured targets

    Context Engine & Constraints Injection After acquiring the foundational data, the context engine is responsible for generating structured prompts. This engine fuses three major input streams: • Trajectory Compiler:Converts raw coordinates into a Nested Checklist to provide structured targets. •Task-Specific SOP Template:Strictly defines the output format ...

  50. [50]

    think-call-observe

    Multi-Turn Execution Loop In the multi-turn reasoning phase, the Teacher Inference model engages in closed-loop interaction based on the system prompt, the initial image, and the dialogue history. In each turn, the model first outputs a <Think> process to plan the current step, followed by initiating a <Tool Call> to invoke the zoom_in tool, passing in th...

  51. [51]

    upper left corner

    Quality Control & Leakage Sanitization To ensure the reliability and interpretability of the final SFT data, the generated trajectories must undergo rigorous quality filtering: •Leakage Sanitization:Rejects samples with Pre-action Coordinate Injection. •Structure Validation:Verifies that the data follows a legal interaction graph structure. 13 • Ground Tr...

  52. [52]

    name": "zoom_in

    zoom_in: { "name": "zoom_in", "arguments": { "source_image_id": "<COPY_EXACT_ID_HERE>", "bbox": [x_min, y_min, x_max, y_max] } } * ID RULE: The source_image_id MUST be copied EXACTLY from theCurrent View: line in the most recent [System Observation]. DO NOT guess! * BBOX RULE: The bbox MUST be strictly RELATIVE to the CURRENT view on a 0-1000 scale. [0, 0...

  53. [53]

    NEVER DO BOTH IN ONE TURN

    ONE STEP AT A TIME: You must either explore using a tool OR output the final answer. NEVER DO BOTH IN ONE TURN

  54. [54]

    Do NOT hallucinate the tool result

    IF CALLING A TOOL: Output your <think> reasoning, then output the <tool_call>, and then STOP IMMEDIATELY . Do NOT hallucinate the tool result. Wait for the [System Observation] to return the cropped image

  55. [55]

    Place your final answer inside the <answer></answer> tags

    FINAL ANSWER: Once your investigation is complete, follow the specific output format required by your current Task (e.g., an integer count, a bounding box, or a classification letter). Place your final answer inside the <answer></answer> tags. Table 5: System Prompt 17 Prompt for Hierarchical Counting Interleaved CoT Generation Task: Global-Context Counti...

  56. [56]

    Use ONLY the initial global view

    Do NOT use ‘zoom_in’. Use ONLY the initial global view

  57. [57]

    Scan carefully across the entire image

  58. [58]

    MANDATORY FORMAT: You MUST explicitly list the accurate global coordinates [xmin, ymin, xmax, ymax] that enclose EVERY target you find inside your <think> block, strictly using this bulleted format: • Obj: [xmin, ymin, xmax, ymax] • Obj: [xmin, ymin, xmax, ymax]

  59. [59]

    Just list the objects

    NO PLANS: Do not write a [PLAN] or [PROGRESS]. Just list the objects

  60. [60]

    Task: Region-Exploration Counting Algorithm SOP (MAXIMUM 1 LAYER):

    ONLY after closing </think>, output the final count in <answer>. Task: Region-Exploration Counting Algorithm SOP (MAXIMUM 1 LAYER):

  61. [61]

    • REASONING: Explicitly explain your zoom strategy

    INITIAL REASONING & PLAN (Turn 1 ONLY): Observe the global view. • REASONING: Explicitly explain your zoom strategy. IF targets are clustered, state that you see specific clusters and will use custom boxes. IF targets are scattered globally, state that they are too dispersed and you will use a 4-quadrant split. • PLAN: After reasoning, write your checklis...

  62. [62]

    • PROGRESS: Update your checklist under [PROGRESS] (mark completed with [x])

    EXECUTION & PROGRESS (Turn 2+): • REASONING: Briefly explain what you are examining in the current cropped view. • PROGRESS: Update your checklist under [PROGRESS] (mark completed with [x]). DO NOT output [PLAN] again

  63. [63]

    LOCAL SUMMARY (SPATIAL FILTERING): Whenever you inspect a cropped image, explicitly list the accurate GLOBAL coordinates formatted strictly as * Obj: [xmin, ymin, xmax, ymax] ONLY for objects that physically fall within the current crop

  64. [64]

    Group and dump ALL the global bounding boxes you calculated

    FINAL AGGREGATION: ONLY when your [PROGRESS] shows all items as [x], you MUST write a [FINAL AGGREGATION] section. Group and dump ALL the global bounding boxes you calculated

  65. [65]

    Task: Object-Targeted Counting Algorithm SOP (MAXIMUM 2 LAYERS):

    Close </think>, then output the total integer count in <answer>. Task: Object-Targeted Counting Algorithm SOP (MAXIMUM 2 LAYERS):

  66. [66]

    Scattered)

    INITIAL REASONING & PLAN (Turn 1 ONLY): • REASONING: Explain your Layer 1 zoom strategy based on target distribution (Clustered vs. Scattered). • PLAN: Write your Layer 1 checklist under [PLAN]. NEVER use [0, 0, 1000, 1000]

  67. [67]

    • PROGRESS: Update your checklist under [PROGRESS]

    EXECUTION & PROGRESS (Turn 2+): • REASONING: Before acting, briefly explain your visual findings in the current crop and whether a Layer 2 zoom is needed. • PROGRESS: Update your checklist under [PROGRESS]. Append Layer 2 zooms as indented nested items if necessary. DO NOT output [PLAN] again

  68. [68]

    LOCAL SUMMARY (SPATIAL FILTERING): For EVERY object found INSIDE THE CURRENT CROP, list the accurate GLOBAL coordinates strictly formatted as * Obj: [xmin, ymin, xmax, ymax]

  69. [69]

    FINAL AGGREGATION: ONLY when your [PROGRESS] is completely exhausted (all items are [x]), you MUST write a [FINAL AGGREGATION] section listing EVERY single global bounding box you discovered, grouped by region

  70. [70]

    Table 6: Prompt for Hierarchical Counting CoT 18 Prompt for Visual Grounding Interleaved CoT Generation Task: Visual Grounding Algorithm SOP (DYNAMIC ZOOMING, MAXIMUM 2 LAYERS):

    Ensure </think> is closed BEFORE outputting <answer>. Table 6: Prompt for Hierarchical Counting CoT 18 Prompt for Visual Grounding Interleaved CoT Generation Task: Visual Grounding Algorithm SOP (DYNAMIC ZOOMING, MAXIMUM 2 LAYERS):

  71. [71]

    Observe the global view to locate the specific target described in the prompt

  72. [72]

    • Targeted Zoom:If the target is small or unclear, execute a precise ‘zoom_in’ on the region containing it

    SMART ZOOM DECISION (Dynamic): •No Zoom:If the target is large and clearly visible in the global view, do NOT zoom. • Targeted Zoom:If the target is small or unclear, execute a precise ‘zoom_in’ on the region containing it. • Recursive Zoom:If it is STILL unclear in the cropped image, execute a secondary ‘zoom_in’ on that crop

  73. [74]

    IMMEDIATE GLOBAL MAPPING: Once you can clearly identify the target, state your finding and immediately output its exact GLOBAL coordinates [xmin, ymin, xmax, ymax] (0-1000 scale) inside your <think> block

  74. [75]

    CONFIDENT FINAL ANSWER: In your final turn, explicitly restate the exact global bounding box inside your <think> block

  75. [76]

    Table 7: Prompt for Visual Grounding CoT Prompt for Classification Interleaved CoT Generation Task: Image/Object Classification Algorithm SOP (DYNAMIC ZOOMING, MAXIMUM 1 LAYER):

    Close </think>, then output ONLY the exact global bounding box array in <answer> (e.g., <answer>[xmin, ymin, xmax, ymax]</answer>). Table 7: Prompt for Visual Grounding CoT Prompt for Classification Interleaved CoT Generation Task: Image/Object Classification Algorithm SOP (DYNAMIC ZOOMING, MAXIMUM 1 LAYER):

  76. [77]

    Observe the given view and read the multiple-choice options provided in the prompt carefully

  77. [78]

    • Targeted Zoom:If the image is large and the specific object mentioned in the prompt is too small to recognize confidently, plan ONE precise ‘ zoom_in’ to confirm its features

    SMART ZOOM DECISION (Dynamic): • No Zoom (Preferred for Global/Low-Res):If the image resolution is low, or the overall scene/object is already clear enough to classify, do NOT use ‘zoom_in’ tools. • Targeted Zoom:If the image is large and the specific object mentioned in the prompt is too small to recognize confidently, plan ONE precise ‘ zoom_in’ to conf...

  78. [79]

    VISUAL FEATURE DEDUCTION: Before concluding, explicitly describe the visual features, textures, colors, or structural layouts you observe that match one of the given categories

  79. [80]

    CONFIDENT CONCLUSION: State which option best aligns with your visual analysis

  80. [81]

    Table 8: Prompt for Classification CoT Prompt for Spatial Relationship Interleaved CoT Generation Task: Spatial Relationship Algorithm SOP (DYNAMIC ZOOMING, MAXIMUM 2 LAYERS):

    FINAL OUTPUT FORMAT: Close </think>, then output ONLY the exact letter corre- sponding to the correct answer inside <answer> (e.g., <answer>A</answer>). Table 8: Prompt for Classification CoT Prompt for Spatial Relationship Interleaved CoT Generation Task: Spatial Relationship Algorithm SOP (DYNAMIC ZOOMING, MAXIMUM 2 LAYERS):

Showing first 80 references.