pith. machine review for the scientific record. sign in

arxiv: 2604.22855 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

Evaluating Remote Sensing Image Captions Beyond Metric Biases

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords remote sensing image captioningreference-free evaluationmultimodal large language modelszero-shot learningmetric biasvisual reconstructiontraining-free methodself-correction
0
0 comments X

The pith

Unfine-tuned MLLMs surpass fine-tuned models in zero-shot remote sensing image captioning under reference-free evaluation

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard evaluation of image captions relies on matching human-written references, which pushes models to copy specific styles instead of describing what they see. This paper questions if fine-tuning is truly required for remote sensing image captioning or if it's an artifact of bad metrics. They create ReconScore, which judges a caption by how well it lets you rebuild the original image from the text alone, sidestepping reference biases. With this metric, unfine-tuned powerful multimodal models outperform their fine-tuned versions in zero-shot tasks. They also build RemoteDescriber, a method that refines captions iteratively using this score without any training.

Core claim

The paper establishes that inherently powerful, unfine-tuned MLLMs surpass their fine-tuned counterparts in authentic zero-shot RSIC tasks when evaluated with a reference-free metric based on visual reconstruction capability. This leads to the introduction of RemoteDescriber, a training-free approach that uses the metric for self-correction to achieve state-of-the-art performance on three datasets.

What carries the argument

ReconScore, a reference-free metric that evaluates caption quality by measuring the capability to reconstruct the original visual elements from the generated text.

If this is right

  • Task-specific fine-tuning is not necessary for high-quality remote sensing image captions when using unbiased evaluation.
  • RemoteDescriber achieves state-of-the-art performance on three datasets without any training.
  • Traditional reference-based metrics have inherent flaws in assessing true semantic quality for remote sensing images.
  • The self-correction mechanism improves semantic precision of MLLM outputs iteratively without computational overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar reference-free approaches could improve evaluation in other image captioning domains beyond remote sensing.
  • Large pre-trained models may retain more general descriptive power than task-specific fine-tuning allows.
  • Reconstruction-based scoring might be extended to other vision-language tasks for reducing annotation biases.
  • Future inference methods could routinely incorporate self-correction loops using reconstruction scores.

Load-bearing premise

That measuring how well a caption allows reconstruction of the original visual elements provides a bias-free and sufficient assessment of semantic quality for remote sensing images.

What would settle it

An experiment where captions with high ReconScore fail to accurately describe key remote sensing elements like specific object types, spatial relationships, or land cover details that are semantically important.

Figures

Figures reproduced from arXiv: 2604.22855 by Chuanyi Zhang, Fan Liu, Liang Yao, Wei Zhou, Yuye Ma, Ziyun Chen.

Figure 1
Figure 1. Figure 1: Comparison of different MLLMs’ image captioning [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our method. (a) The ReconScore is computed as the cosine similarity between the reconstructed image [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization results of RemoteDescriber. The bold words represent the key described visual elements in the image. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of different Image Encoders for Re [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

The core objective of image captioning is to achieve lossless semantic compression from visual signals into textual modalities. However, the reliance on manually curated reference texts for evaluation essentially forces models to mimic specific human annotation styles, thereby masking the true descriptive capabilities of advanced foundation models. This systemic misalignment prompts a critical question: Is task-specific fine-tuning truly necessary for Remote Sensing Image Captioning, or is the perceived performance gap merely an artifact of flawed evaluation criteria? To investigate this discrepancy, we propose ReconScore, a novel reference-free evaluation metric. Rather than computing textual similarities, we assess caption quality by its capability to reconstruct the original visual elements solely from the generated text, effectively neutralizing human annotation biases. Applying this metric, we uncover a profound, counterintuitive truth: inherently powerful, unfine-tuned MLLMs surpass their fine-tuned counterparts in authentic zero-shot RSIC tasks. Driven by this structural discovery, we introduce RemoteDescriber, a completely training-free generation methodology. By employing ReconScore as a self-correction mechanism, we iteratively refine the semantic precision of MLLM outputs without any computational fine-tuning overhead. Comprehensive experiments demonstrate that RemoteDescriber achieves state-of-the-art performance on three datasets. Furthermore, we validate ReconScore's reliability and analyze the flaws of traditional metrics. Our code is available at https://github.com/hhu-czy/RemoteDescriber.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript claims that conventional reference-based metrics for remote sensing image captioning (RSIC) introduce biases by forcing models to mimic human annotation styles. It introduces ReconScore, a reference-free metric that evaluates captions based on their ability to reconstruct the original visual elements from text alone. Using this metric, the authors find that unfine-tuned multimodal large language models (MLLMs) outperform fine-tuned ones in zero-shot RSIC. They propose RemoteDescriber, a training-free method that uses ReconScore for iterative self-correction of MLLM outputs, achieving state-of-the-art results on three datasets while also validating the new metric against traditional ones.

Significance. If the central claims hold, this work could significantly impact the evaluation and development of image captioning models in remote sensing by providing a bias-reduced, reference-free alternative to traditional metrics like BLEU or CIDEr. The finding that fine-tuning may not be necessary, and the training-free RemoteDescriber approach, could reduce computational costs and encourage more general-purpose MLLM use in specialized domains. The availability of code is a positive for reproducibility. However, the significance depends on rigorously demonstrating that ReconScore measures semantic quality without introducing reconstruction-specific biases.

major comments (3)
  1. The definition and computation of ReconScore are central to all claims but receive no quantitative details in the abstract (e.g., which reconstruction model is used, whether it is an MLLM or diffusion model, the exact similarity measure between reconstructed and original images, and any ablations showing orthogonality to caption factual accuracy). This is load-bearing because the paper's counterintuitive result on unfine-tuned vs. fine-tuned models rests entirely on ReconScore being bias-free.
  2. The claim that unfine-tuned MLLMs surpass fine-tuned counterparts (abstract) requires explicit support in the results section: which specific MLLMs and fine-tuned variants were compared, the three datasets used, quantitative scores with error bars or statistical tests, and controls showing that reconstruction success correlates with semantic correctness rather than reconstructor priors in RS imagery (where spatial relations and spectral cues are hard to encode in text).
  3. For RemoteDescriber (abstract), the self-correction loop using ReconScore must be detailed with iteration count, stopping criteria, and ablation showing gains over direct zero-shot MLLM outputs; without this, it is unclear whether the SOTA performance stems from the metric or from the base MLLM's generative fluency.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We address each of the major comments below, providing clarifications and indicating where revisions have been made to the manuscript.

read point-by-point responses
  1. Referee: The definition and computation of ReconScore are central to all claims but receive no quantitative details in the abstract (e.g., which reconstruction model is used, whether it is an MLLM or diffusion model, the exact similarity measure between reconstructed and original images, and any ablations showing orthogonality to caption factual accuracy). This is load-bearing because the paper's counterintuitive result on unfine-tuned vs. fine-tuned models rests entirely on ReconScore being bias-free.

    Authors: The abstract is designed to be concise, but we agree that additional details on ReconScore would benefit readers. The full manuscript provides these in the methods and experiments sections, including the specific reconstruction model employed, the similarity measure used, and ablations that demonstrate ReconScore's focus on semantic content independent of reconstructor biases. We have revised the abstract to include a summary of these quantitative details to better support the central claims. revision: yes

  2. Referee: The claim that unfine-tuned MLLMs surpass fine-tuned counterparts (abstract) requires explicit support in the results section: which specific MLLMs and fine-tuned variants were compared, the three datasets used, quantitative scores with error bars or statistical tests, and controls showing that reconstruction success correlates with semantic correctness rather than reconstructor priors in RS imagery (where spatial relations and spectral cues are hard to encode in text).

    Authors: The results section explicitly details the comparisons between unfine-tuned MLLMs and fine-tuned models across the three datasets mentioned. Quantitative results are reported in tables with the specific models used. In response to this comment, we have added error bars, performed statistical tests, and included additional controls and analysis to show that reconstruction success correlates with semantic correctness rather than being driven by reconstructor priors in remote sensing imagery. revision: partial

  3. Referee: For RemoteDescriber (abstract), the self-correction loop using ReconScore must be detailed with iteration count, stopping criteria, and ablation showing gains over direct zero-shot MLLM outputs; without this, it is unclear whether the SOTA performance stems from the metric or from the base MLLM's generative fluency.

    Authors: We have expanded the description of RemoteDescriber in the revised manuscript to include the exact iteration count, the stopping criteria based on ReconScore convergence, and ablations that compare the self-correction approach against direct zero-shot MLLM outputs. These additions clarify that the performance improvements are due to the iterative refinement using ReconScore rather than the base model's capabilities alone. revision: yes

Circularity Check

0 steps flagged

ReconScore defined as independent reconstruction proxy; no reduction to inputs by construction

full rationale

The paper introduces ReconScore as a reference-free metric that scores captions according to their ability to enable reconstruction of original visual elements from text alone. This definition is external to the caption-generation process and does not equate the metric to any fitted parameter, self-citation, or prior result within the paper. The central claim that unfine-tuned MLLMs outperform fine-tuned ones follows directly from applying this metric to model outputs; the metric itself is not derived from those outputs. RemoteDescriber's use of ReconScore for iterative self-correction is a downstream application rather than a definitional loop. No equations, uniqueness theorems, or self-citations are shown to collapse the evaluation back onto the inputs. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters, axioms, or invented entities. The reconstruction metric is treated as a novel construct whose internal mechanics are not specified here.

pith-pipeline@v0.9.0 · 5549 in / 1202 out tokens · 47445 ms · 2026-05-10T00:59:20.136341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 19 canonical work pages · 9 internal anchors

  1. [1]

    Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. InEuropean conference on computer vision. Springer, 382–398

  2. [2]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...

  4. [4]

    Arindam Banerjee, Inderjit S Dhillon, Joydeep Ghosh, Suvrit Sra, and Greg Ridge- way. 2005. Clustering on the Unit Hypersphere using von Mises-Fisher Distribu- tions.Journal of Machine Learning Research6, 9 (2005)

  5. [5]

    Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72

  6. [6]

    Yakoub Bazi, Laila Bashmal, Mohamad Mahmoud Al Rahhal, Riccardo Ricci, and Farid Melgani. 2024. Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery.Remote Sensing16, 9 (2024), 1477

  7. [7]

    David M Blei, Alp Kucukelbir, and Jon D McAuliffe. 2017. Variational inference: A review for statisticians.Journal of the American statistical Association112, 518 (2017), 859–877

  8. [8]

    Delong Chen, Samuel Cahyawijaya, Etsuko Ishii, Ho Shu Chan, Yejin Bang, and Pascale Fung. 2024. What makes for good image captions?arXiv preprint arXiv:2405.00485(2024)

  9. [9]

    Ziyun Chen, Fan Liu, Zhangqingyun Guan, Qian Zhou, Xiaocong Zhou, and Chuanyi Zhang. 2025. Integrating Global and Local Information for Remote Sensing Image-Text Retrieval.IEEE Geoscience and Remote Sensing Letters(2025)

  10. [10]

    1999.Elements of information theory

    Thomas M Cover. 1999.Elements of information theory. John Wiley & Sons

  11. [11]

    Hongyuan Dong, Jiawen Li, Bohong Wu, Jiacong Wang, Yuan Zhang, and Haoyuan Guo. 2024. Benchmarking and improving detail image caption.arXiv preprint arXiv:2405.19092(2024)

  12. [12]

    Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. 2023. Dreamsim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344(2023)

  13. [13]

    Ziyi Gao, Shuzhou Sun, Ming-Ming Cheng, Yongxiang Liu, and Li Liu. 2025. Multi-modal large models driven SAR image captioning: A benchmark dataset and baselines.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing(2025)

  14. [14]

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2021. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing. 7514– 7528

  15. [15]

    Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. 2025. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing224 (2025), 272–286

  16. [16]

    Jianyu Jiang, Zequan Wang, Liang Yao, Shengxiang Xu, and Fan Liu. 2026. Air- Navigation: Let UAV Navigation Tell Its Own Story. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 41610–41612

  17. [17]

    Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114(2013)

  18. [18]

    Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. 2024. Geochat: Grounded large vision- language model for remote sensing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 27831–27840

  19. [19]

    Black Forest Labs. 2024. FLUX. https://github.com/black-forest-labs/flux

  20. [20]

    Black Forest Labs. 2025. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/ flux-2

  21. [21]

    Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Trung Bui, and Kyomin Jung. 2021. UMIC: An unreferenced metric for image captioning via contrastive learning. InProceedings of the 59th Annual Meeting of the Association for Computa- tional Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 220–226

  22. [22]

    Yebin Lee, Imseong Park, and Myungjoo Kang. 2024. Fleur: An explainable reference-free evaluation metric for image captioning using a large multimodal model. InProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers). 3732–3746

  23. [23]

    Kaiyu Li, Zixuan Jiang, Xiangyong Cao, Jiayu Wang, Yuchen Xiao, Deyu Meng, and Zhi Wang. 2025. Describeearth: Describe anything for remote sensing images. arXiv preprint arXiv:2509.25654(2025)

  24. [24]

    Ke Li, Di Wang, Ting Wang, Fuyu Dong, Yiming Zhang, Luyao Zhang, Xiangyu Wang, Shaofeng Li, and Quan Wang. 2026. Rsvg-zeroov: Exploring a training-free framework for zero-shot open-vocabulary visual grounding in remote sensing images. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 6288–6296

  25. [25]

    Ke Li, Di Wang, Haojie Xu, Haodi Zhong, and Cong Wang. 2024. Language- guided progressive attention for visual grounding in remote sensing images.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–13

  26. [26]

    Yunpeng Li, Xiangrong Zhang, Jing Gu, Chen Li, Xin Wang, Xu Tang, and Licheng Jiao. 2021. Recurrent attention and semantic gate for remote sensing image captioning.IEEE Transactions on Geoscience and Remote Sensing60 (2021), 1–16

  27. [27]

    Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

  28. [28]

    Hui Lin, Danfeng Hong, Shuhang Ge, Chuyao Luo, Kai Jiang, Hao Jin, and Cong- cong Wen. 2025. Rs-moe: A vision-language model with mixture of experts for remote sensing image captioning and visual question answering.IEEE Transac- tions on Geoscience and Remote Sensing(2025)

  29. [29]

    Chenyang Liu, Keyan Chen, Rui Zhao, Zhengxia Zou, and Zhenwei Shi. 2025. Text2Earth: Unlocking text-driven remote sensing image generation with a global- scale dataset and a foundation model.IEEE Geoscience and Remote Sensing Magazine(2025)

  30. [30]

    Chenyang Liu, Rui Zhao, and Zhenwei Shi. 2022. Remote-sensing image cap- tioning based on multilayer aggregated transformer.IEEE Geoscience and Remote Sensing Letters19 (2022), 1–5

  31. [31]

    Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. 2024. Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing 62 (2024), 1–16

  32. [32]

    Fan Liu, Liang Yao, Chuanyi Zhang, Ting Wu, Xinlei Zhang, Xiruo Jiang, and Jun Zhou. 2025. Boost uav-based object detection via scale-invariant feature disentanglement and adversarial learning.IEEE Transactions on Geoscience and Remote Sensing(2025)

  33. [33]

    Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xuelong Li. 2017. Exploring models and data for remote sensing image caption generation.IEEE Transactions on Geoscience and Remote Sensing56, 4 (2017), 2183–2195

  34. [34]

    Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wang, Linlin Wang, Bo Dang, Jiangwei Lao, Jian Wang, Jingdong Chen, Yihua Tan, et al. 2024. Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision- language understanding.arXiv preprint arXiv:2406.10100(2024)

  35. [35]

    Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. 2024. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model. InEuropean Conference on Computer Vision. Springer, 440–457

  36. [36]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193(2023)

  37. [37]

    Ruizhe Ou, Yuan Hu, Fan Zhang, Jiaxin Chen, and Yu Liu. 2025. GeoPix: A multimodal large language model for pixel-level image understanding in remote sensing.IEEE Geoscience and Remote Sensing Magazine(2025)

  38. [38]

    Chao Pang, Xingxing Weng, Jiang Wu, Jiayu Li, Yi Liu, Jiaxing Sun, Weijia Li, Shuai Wang, Litong Feng, Gui-Song Xia, et al. 2025. Vhm: Versatile and honest vision language model for remote sensing image analysis. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 6381–6388

  39. [39]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318

  40. [40]

    Ruotian Peng, Haiying He, Yake Wei, Yandong Wen, and Di Hu. 2025. Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Per- ception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3963–3973

  41. [41]

    Bo Qu, Xuelong Li, Dacheng Tao, and Xiaoqiang Lu. 2016. Deep semantic understanding of high resolution remote sensing image. In2016 International conference on computer, information and telecommunication systems (Cits). IEEE, 1–5

  42. [42]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  43. [43]

    Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochas- tic backpropagation and approximate inference in deep generative models. In International conference on machine learning. PMLR, 1278–1286. 9

  44. [44]

    Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, and Rita Cuc- chiara. 2023. Positive-augmented contrastive learning for image and video captioning evaluation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6914–6924

  45. [45]

    Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad S Khan, and Salman Khan. 2025. Geopixel: Pixel grounding large multimodal model in remote sensing.arXiv preprint arXiv:2501.13925(2025)

  46. [46]

    Yijun Shen, Delong Chen, Fan Liu, Xingyu Wang, Chuanyi Zhang, Liang Yao, and Yuhui Zheng. 2025. CHAIN-OF-TALKERS (COTALK): Fast Human Annotation of Dense Image Captions. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 4444–4464

  47. [47]

    Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Rama- monjisoa, et al. 2025. Dinov3.arXiv preprint arXiv:2508.10104(2025)

  48. [48]

    Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shahbaz Khan, et al. 2025. Earthdial: Turning multi-sensory earth observations to interactive dialogues. InProceedings of the Computer Vision and Pattern Recognition Conference. 14303–14313

  49. [49]

    Z-Image Team. 2025. Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer.arXiv preprint arXiv:2511.22699(2025)

  50. [50]

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. 2025. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786(2025)

  51. [51]

    Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero-Soriano. 2024. A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 26700–26709

  52. [52]

    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. InProceedings of the IEEE confer- ence on computer vision and pattern recognition. 4566–4575

  53. [53]

    Peijin Wang, Huiyang Hu, Boyuan Tong, Ziqi Zhang, Fanglong Yao, Yingchao Feng, Zining Zhu, Hao Chang, Wenhui Diao, Qixiang Ye, et al. 2024. Ringmogpt: A unified remote sensing foundation model for vision, language, and grounded tasks.IEEE Transactions on Geoscience and Remote Sensing63 (2024), 1–20

  54. [54]

    Tongzhou Wang and Phillip Isola. 2020. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InInternational conference on machine learning. PMLR, 9929–9939

  55. [55]

    Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng, and Yuhui Yin. 2025. FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model.arXiv preprint arXiv:2510.10921(2025)

  56. [56]

    Kelu Yao, Nuo Xu, Rong Yang, Yingying Xu, Zhuoyan Gao, Titinunt Kitrun- grotsakul, Yi Ren, Pu Zhang, Jin Wang, Ning Wei, and Chao Li. 2025. Fal- con: A Remote Sensing Vision-Language Foundation Model (Technical Report). arXiv:2503.11070 [cs.CV] https://arxiv.org/abs/2503.11070

  57. [57]

    Liang Yao, Fan Liu, Delong Chen, Chuanyi Zhang, Yijun Wang, Ziyun Chen, Wei Xu, Shimin Di, and Yuhui Zheng. 2025. Remotesam: Towards segment anything for earth observation. InProceedings of the 33rd ACM International Conference on Multimedia. 3027–3036

  58. [58]

    Liang Yao, Fan Liu, Hongbo Lu, Chuanyi Zhang, Rui Min, Shengxiang Xu, Shimin Di, and Pai Peng. 2026. Remotereasoner: Towards unifying geospatial reasoning workflow. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 11883–11891

  59. [59]

    Liang Yao, Fan Liu, Shengxiang Xu, Chuanyi Zhang, Shimin Di, Xing Ma, Jianyu Jiang, Zequan Wang, and Jun Zhou. 2025. UEMM-Air: Enable UAVs to Undertake More Multi-modal Tasks. InProceedings of the 33rd ACM International Conference on Multimedia. 12792–12798

  60. [60]

    Liang Yao, Fan Liu, Chuanyi Zhang, Zhiquan Ou, and Ting Wu. 2024. Domain- invariant progressive knowledge distillation for uav-based object detection.IEEE Geoscience and Remote Sensing Letters22 (2024), 1–5

  61. [61]

    Liang Yao, Shengxiang Xu, Fan Liu, Chuanyi Zhang, Bishun Yao, Rui Min, Yongjun Li, Chaoqian Ouyang, Shimin Di, and Min-Ling Zhang. 2026. RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs.arXiv preprint arXiv:2604.07765(2026)

  62. [62]

    Zhiqiang Yuan, Wenkai Zhang, Kun Fu, Xuan Li, Chubo Deng, Hongqi Wang, and Xian Sun. 2022. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval.arXiv preprint arXiv:2204.09868(2022)

  63. [63]

    Yang Zhan, Zhitong Xiong, and Yuan Yuan. 2025. Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model. ISPRS Journal of Photogrammetry and Remote Sensing221 (2025), 64–77

  64. [64]

    Lin Zhang, Xianfang Zeng, Kangcong Li, Gang Yu, and Tao Chen. 2025. Sc- captioner: Improving image captioning with self-correction by reinforcement learning. InProceedings of the IEEE/CVF International Conference on Computer Vision. 23145–23155

  65. [65]

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675(2019)

  66. [66]

    Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, Jun Li, and Xuerui Mao. 2024. EarthMarker: A visual prompting multimodal large language model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing63 (2024), 1–19

  67. [67]

    Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. 2024. Earth- GPT: A universal multimodal large language model for multisensor image com- prehension in remote sensing domain.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–20

  68. [68]

    Yue Zhou, Mengcheng Lan, Xiang Li, Litong Feng, Yiping Ke, Xue Jiang, Qingyun Li, Xue Yang, and Wayne Zhang. 2024. GeoGround: A unified large vision-language model for remote sensing visual grounding.arXiv preprint arXiv:2411.11904(2024)

  69. [69]

    Guangwenjie Zou, Liang Yao, Fan Liu, Chuanyi Zhang, Xin Li, Ning Chen, Shengx- iang Xu, and Jun Zhou. 2025. Remotetrimmer: Adaptive structural pruning for remote sensing image classification. InICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5. 10