Recognition: unknown
Evaluating Remote Sensing Image Captions Beyond Metric Biases
Pith reviewed 2026-05-10 00:59 UTC · model grok-4.3
The pith
Unfine-tuned MLLMs surpass fine-tuned models in zero-shot remote sensing image captioning under reference-free evaluation
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that inherently powerful, unfine-tuned MLLMs surpass their fine-tuned counterparts in authentic zero-shot RSIC tasks when evaluated with a reference-free metric based on visual reconstruction capability. This leads to the introduction of RemoteDescriber, a training-free approach that uses the metric for self-correction to achieve state-of-the-art performance on three datasets.
What carries the argument
ReconScore, a reference-free metric that evaluates caption quality by measuring the capability to reconstruct the original visual elements from the generated text.
If this is right
- Task-specific fine-tuning is not necessary for high-quality remote sensing image captions when using unbiased evaluation.
- RemoteDescriber achieves state-of-the-art performance on three datasets without any training.
- Traditional reference-based metrics have inherent flaws in assessing true semantic quality for remote sensing images.
- The self-correction mechanism improves semantic precision of MLLM outputs iteratively without computational overhead.
Where Pith is reading between the lines
- Similar reference-free approaches could improve evaluation in other image captioning domains beyond remote sensing.
- Large pre-trained models may retain more general descriptive power than task-specific fine-tuning allows.
- Reconstruction-based scoring might be extended to other vision-language tasks for reducing annotation biases.
- Future inference methods could routinely incorporate self-correction loops using reconstruction scores.
Load-bearing premise
That measuring how well a caption allows reconstruction of the original visual elements provides a bias-free and sufficient assessment of semantic quality for remote sensing images.
What would settle it
An experiment where captions with high ReconScore fail to accurately describe key remote sensing elements like specific object types, spatial relationships, or land cover details that are semantically important.
Figures
read the original abstract
The core objective of image captioning is to achieve lossless semantic compression from visual signals into textual modalities. However, the reliance on manually curated reference texts for evaluation essentially forces models to mimic specific human annotation styles, thereby masking the true descriptive capabilities of advanced foundation models. This systemic misalignment prompts a critical question: Is task-specific fine-tuning truly necessary for Remote Sensing Image Captioning, or is the perceived performance gap merely an artifact of flawed evaluation criteria? To investigate this discrepancy, we propose ReconScore, a novel reference-free evaluation metric. Rather than computing textual similarities, we assess caption quality by its capability to reconstruct the original visual elements solely from the generated text, effectively neutralizing human annotation biases. Applying this metric, we uncover a profound, counterintuitive truth: inherently powerful, unfine-tuned MLLMs surpass their fine-tuned counterparts in authentic zero-shot RSIC tasks. Driven by this structural discovery, we introduce RemoteDescriber, a completely training-free generation methodology. By employing ReconScore as a self-correction mechanism, we iteratively refine the semantic precision of MLLM outputs without any computational fine-tuning overhead. Comprehensive experiments demonstrate that RemoteDescriber achieves state-of-the-art performance on three datasets. Furthermore, we validate ReconScore's reliability and analyze the flaws of traditional metrics. Our code is available at https://github.com/hhu-czy/RemoteDescriber.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that conventional reference-based metrics for remote sensing image captioning (RSIC) introduce biases by forcing models to mimic human annotation styles. It introduces ReconScore, a reference-free metric that evaluates captions based on their ability to reconstruct the original visual elements from text alone. Using this metric, the authors find that unfine-tuned multimodal large language models (MLLMs) outperform fine-tuned ones in zero-shot RSIC. They propose RemoteDescriber, a training-free method that uses ReconScore for iterative self-correction of MLLM outputs, achieving state-of-the-art results on three datasets while also validating the new metric against traditional ones.
Significance. If the central claims hold, this work could significantly impact the evaluation and development of image captioning models in remote sensing by providing a bias-reduced, reference-free alternative to traditional metrics like BLEU or CIDEr. The finding that fine-tuning may not be necessary, and the training-free RemoteDescriber approach, could reduce computational costs and encourage more general-purpose MLLM use in specialized domains. The availability of code is a positive for reproducibility. However, the significance depends on rigorously demonstrating that ReconScore measures semantic quality without introducing reconstruction-specific biases.
major comments (3)
- The definition and computation of ReconScore are central to all claims but receive no quantitative details in the abstract (e.g., which reconstruction model is used, whether it is an MLLM or diffusion model, the exact similarity measure between reconstructed and original images, and any ablations showing orthogonality to caption factual accuracy). This is load-bearing because the paper's counterintuitive result on unfine-tuned vs. fine-tuned models rests entirely on ReconScore being bias-free.
- The claim that unfine-tuned MLLMs surpass fine-tuned counterparts (abstract) requires explicit support in the results section: which specific MLLMs and fine-tuned variants were compared, the three datasets used, quantitative scores with error bars or statistical tests, and controls showing that reconstruction success correlates with semantic correctness rather than reconstructor priors in RS imagery (where spatial relations and spectral cues are hard to encode in text).
- For RemoteDescriber (abstract), the self-correction loop using ReconScore must be detailed with iteration count, stopping criteria, and ablation showing gains over direct zero-shot MLLM outputs; without this, it is unclear whether the SOTA performance stems from the metric or from the base MLLM's generative fluency.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback. We address each of the major comments below, providing clarifications and indicating where revisions have been made to the manuscript.
read point-by-point responses
-
Referee: The definition and computation of ReconScore are central to all claims but receive no quantitative details in the abstract (e.g., which reconstruction model is used, whether it is an MLLM or diffusion model, the exact similarity measure between reconstructed and original images, and any ablations showing orthogonality to caption factual accuracy). This is load-bearing because the paper's counterintuitive result on unfine-tuned vs. fine-tuned models rests entirely on ReconScore being bias-free.
Authors: The abstract is designed to be concise, but we agree that additional details on ReconScore would benefit readers. The full manuscript provides these in the methods and experiments sections, including the specific reconstruction model employed, the similarity measure used, and ablations that demonstrate ReconScore's focus on semantic content independent of reconstructor biases. We have revised the abstract to include a summary of these quantitative details to better support the central claims. revision: yes
-
Referee: The claim that unfine-tuned MLLMs surpass fine-tuned counterparts (abstract) requires explicit support in the results section: which specific MLLMs and fine-tuned variants were compared, the three datasets used, quantitative scores with error bars or statistical tests, and controls showing that reconstruction success correlates with semantic correctness rather than reconstructor priors in RS imagery (where spatial relations and spectral cues are hard to encode in text).
Authors: The results section explicitly details the comparisons between unfine-tuned MLLMs and fine-tuned models across the three datasets mentioned. Quantitative results are reported in tables with the specific models used. In response to this comment, we have added error bars, performed statistical tests, and included additional controls and analysis to show that reconstruction success correlates with semantic correctness rather than being driven by reconstructor priors in remote sensing imagery. revision: partial
-
Referee: For RemoteDescriber (abstract), the self-correction loop using ReconScore must be detailed with iteration count, stopping criteria, and ablation showing gains over direct zero-shot MLLM outputs; without this, it is unclear whether the SOTA performance stems from the metric or from the base MLLM's generative fluency.
Authors: We have expanded the description of RemoteDescriber in the revised manuscript to include the exact iteration count, the stopping criteria based on ReconScore convergence, and ablations that compare the self-correction approach against direct zero-shot MLLM outputs. These additions clarify that the performance improvements are due to the iterative refinement using ReconScore rather than the base model's capabilities alone. revision: yes
Circularity Check
ReconScore defined as independent reconstruction proxy; no reduction to inputs by construction
full rationale
The paper introduces ReconScore as a reference-free metric that scores captions according to their ability to enable reconstruction of original visual elements from text alone. This definition is external to the caption-generation process and does not equate the metric to any fitted parameter, self-citation, or prior result within the paper. The central claim that unfine-tuned MLLMs outperform fine-tuned ones follows directly from applying this metric to model outputs; the metric itself is not derived from those outputs. RemoteDescriber's use of ReconScore for iterative self-correction is a downstream application rather than a definitional loop. No equations, uniqueness theorems, or self-citations are shown to collapse the evaluation back onto the inputs. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. InEuropean conference on computer vision. Springer, 382–398
2016
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Arindam Banerjee, Inderjit S Dhillon, Joydeep Ghosh, Suvrit Sra, and Greg Ridge- way. 2005. Clustering on the Unit Hypersphere using von Mises-Fisher Distribu- tions.Journal of Machine Learning Research6, 9 (2005)
2005
-
[5]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72
2005
-
[6]
Yakoub Bazi, Laila Bashmal, Mohamad Mahmoud Al Rahhal, Riccardo Ricci, and Farid Melgani. 2024. Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery.Remote Sensing16, 9 (2024), 1477
2024
-
[7]
David M Blei, Alp Kucukelbir, and Jon D McAuliffe. 2017. Variational inference: A review for statisticians.Journal of the American statistical Association112, 518 (2017), 859–877
2017
- [8]
-
[9]
Ziyun Chen, Fan Liu, Zhangqingyun Guan, Qian Zhou, Xiaocong Zhou, and Chuanyi Zhang. 2025. Integrating Global and Local Information for Remote Sensing Image-Text Retrieval.IEEE Geoscience and Remote Sensing Letters(2025)
2025
-
[10]
1999.Elements of information theory
Thomas M Cover. 1999.Elements of information theory. John Wiley & Sons
1999
- [11]
- [12]
-
[13]
Ziyi Gao, Shuzhou Sun, Ming-Ming Cheng, Yongxiang Liu, and Li Liu. 2025. Multi-modal large models driven SAR image captioning: A benchmark dataset and baselines.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing(2025)
2025
-
[14]
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2021. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing. 7514– 7528
2021
-
[15]
Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. 2025. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing224 (2025), 272–286
2025
-
[16]
Jianyu Jiang, Zequan Wang, Liang Yao, Shengxiang Xu, and Fan Liu. 2026. Air- Navigation: Let UAV Navigation Tell Its Own Story. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 41610–41612
2026
-
[17]
Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114(2013)
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[18]
Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. 2024. Geochat: Grounded large vision- language model for remote sensing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 27831–27840
2024
-
[19]
Black Forest Labs. 2024. FLUX. https://github.com/black-forest-labs/flux
2024
-
[20]
Black Forest Labs. 2025. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/ flux-2
2025
-
[21]
Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Trung Bui, and Kyomin Jung. 2021. UMIC: An unreferenced metric for image captioning via contrastive learning. InProceedings of the 59th Annual Meeting of the Association for Computa- tional Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 220–226
2021
-
[22]
Yebin Lee, Imseong Park, and Myungjoo Kang. 2024. Fleur: An explainable reference-free evaluation metric for image captioning using a large multimodal model. InProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers). 3732–3746
2024
- [23]
-
[24]
Ke Li, Di Wang, Ting Wang, Fuyu Dong, Yiming Zhang, Luyao Zhang, Xiangyu Wang, Shaofeng Li, and Quan Wang. 2026. Rsvg-zeroov: Exploring a training-free framework for zero-shot open-vocabulary visual grounding in remote sensing images. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 6288–6296
2026
-
[25]
Ke Li, Di Wang, Haojie Xu, Haodi Zhong, and Cong Wang. 2024. Language- guided progressive attention for visual grounding in remote sensing images.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–13
2024
-
[26]
Yunpeng Li, Xiangrong Zhang, Jing Gu, Chen Li, Xin Wang, Xu Tang, and Licheng Jiao. 2021. Recurrent attention and semantic gate for remote sensing image captioning.IEEE Transactions on Geoscience and Remote Sensing60 (2021), 1–16
2021
-
[27]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81
2004
-
[28]
Hui Lin, Danfeng Hong, Shuhang Ge, Chuyao Luo, Kai Jiang, Hao Jin, and Cong- cong Wen. 2025. Rs-moe: A vision-language model with mixture of experts for remote sensing image captioning and visual question answering.IEEE Transac- tions on Geoscience and Remote Sensing(2025)
2025
-
[29]
Chenyang Liu, Keyan Chen, Rui Zhao, Zhengxia Zou, and Zhenwei Shi. 2025. Text2Earth: Unlocking text-driven remote sensing image generation with a global- scale dataset and a foundation model.IEEE Geoscience and Remote Sensing Magazine(2025)
2025
-
[30]
Chenyang Liu, Rui Zhao, and Zhenwei Shi. 2022. Remote-sensing image cap- tioning based on multilayer aggregated transformer.IEEE Geoscience and Remote Sensing Letters19 (2022), 1–5
2022
-
[31]
Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. 2024. Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing 62 (2024), 1–16
2024
-
[32]
Fan Liu, Liang Yao, Chuanyi Zhang, Ting Wu, Xinlei Zhang, Xiruo Jiang, and Jun Zhou. 2025. Boost uav-based object detection via scale-invariant feature disentanglement and adversarial learning.IEEE Transactions on Geoscience and Remote Sensing(2025)
2025
-
[33]
Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xuelong Li. 2017. Exploring models and data for remote sensing image caption generation.IEEE Transactions on Geoscience and Remote Sensing56, 4 (2017), 2183–2195
2017
-
[34]
Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wang, Linlin Wang, Bo Dang, Jiangwei Lao, Jian Wang, Jingdong Chen, Yihua Tan, et al. 2024. Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision- language understanding.arXiv preprint arXiv:2406.10100(2024)
-
[35]
Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. 2024. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model. InEuropean Conference on Computer Vision. Springer, 440–457
2024
-
[36]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Ruizhe Ou, Yuan Hu, Fan Zhang, Jiaxin Chen, and Yu Liu. 2025. GeoPix: A multimodal large language model for pixel-level image understanding in remote sensing.IEEE Geoscience and Remote Sensing Magazine(2025)
2025
-
[38]
Chao Pang, Xingxing Weng, Jiang Wu, Jiayu Li, Yi Liu, Jiaxing Sun, Weijia Li, Shuai Wang, Litong Feng, Gui-Song Xia, et al. 2025. Vhm: Versatile and honest vision language model for remote sensing image analysis. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 6381–6388
2025
-
[39]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318
2002
-
[40]
Ruotian Peng, Haiying He, Yake Wei, Yandong Wen, and Di Hu. 2025. Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Per- ception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3963–3973
2025
-
[41]
Bo Qu, Xuelong Li, Dacheng Tao, and Xiaoqiang Lu. 2016. Deep semantic understanding of high resolution remote sensing image. In2016 International conference on computer, information and telecommunication systems (Cits). IEEE, 1–5
2016
-
[42]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763
2021
-
[43]
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochas- tic backpropagation and approximate inference in deep generative models. In International conference on machine learning. PMLR, 1278–1286. 9
2014
-
[44]
Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, and Rita Cuc- chiara. 2023. Positive-augmented contrastive learning for image and video captioning evaluation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6914–6924
2023
- [45]
-
[46]
Yijun Shen, Delong Chen, Fan Liu, Xingyu Wang, Chuanyi Zhang, Liang Yao, and Yuhui Zheng. 2025. CHAIN-OF-TALKERS (COTALK): Fast Human Annotation of Dense Image Captions. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 4444–4464
2025
-
[47]
Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Rama- monjisoa, et al. 2025. Dinov3.arXiv preprint arXiv:2508.10104(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shahbaz Khan, et al. 2025. Earthdial: Turning multi-sensory earth observations to interactive dialogues. InProceedings of the Computer Vision and Pattern Recognition Conference. 14303–14313
2025
-
[49]
Z-Image Team. 2025. Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer.arXiv preprint arXiv:2511.22699(2025)
work page internal anchor Pith review arXiv 2025
-
[50]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. 2025. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786(2025)
work page internal anchor Pith review arXiv 2025
-
[51]
Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero-Soriano. 2024. A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 26700–26709
2024
-
[52]
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. InProceedings of the IEEE confer- ence on computer vision and pattern recognition. 4566–4575
2015
-
[53]
Peijin Wang, Huiyang Hu, Boyuan Tong, Ziqi Zhang, Fanglong Yao, Yingchao Feng, Zining Zhu, Hao Chang, Wenhui Diao, Qixiang Ye, et al. 2024. Ringmogpt: A unified remote sensing foundation model for vision, language, and grounded tasks.IEEE Transactions on Geoscience and Remote Sensing63 (2024), 1–20
2024
-
[54]
Tongzhou Wang and Phillip Isola. 2020. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InInternational conference on machine learning. PMLR, 9929–9939
2020
- [55]
- [56]
-
[57]
Liang Yao, Fan Liu, Delong Chen, Chuanyi Zhang, Yijun Wang, Ziyun Chen, Wei Xu, Shimin Di, and Yuhui Zheng. 2025. Remotesam: Towards segment anything for earth observation. InProceedings of the 33rd ACM International Conference on Multimedia. 3027–3036
2025
-
[58]
Liang Yao, Fan Liu, Hongbo Lu, Chuanyi Zhang, Rui Min, Shengxiang Xu, Shimin Di, and Pai Peng. 2026. Remotereasoner: Towards unifying geospatial reasoning workflow. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 11883–11891
2026
-
[59]
Liang Yao, Fan Liu, Shengxiang Xu, Chuanyi Zhang, Shimin Di, Xing Ma, Jianyu Jiang, Zequan Wang, and Jun Zhou. 2025. UEMM-Air: Enable UAVs to Undertake More Multi-modal Tasks. InProceedings of the 33rd ACM International Conference on Multimedia. 12792–12798
2025
-
[60]
Liang Yao, Fan Liu, Chuanyi Zhang, Zhiquan Ou, and Ting Wu. 2024. Domain- invariant progressive knowledge distillation for uav-based object detection.IEEE Geoscience and Remote Sensing Letters22 (2024), 1–5
2024
-
[61]
Liang Yao, Shengxiang Xu, Fan Liu, Chuanyi Zhang, Bishun Yao, Rui Min, Yongjun Li, Chaoqian Ouyang, Shimin Di, and Min-Ling Zhang. 2026. RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs.arXiv preprint arXiv:2604.07765(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [62]
-
[63]
Yang Zhan, Zhitong Xiong, and Yuan Yuan. 2025. Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model. ISPRS Journal of Photogrammetry and Remote Sensing221 (2025), 64–77
2025
-
[64]
Lin Zhang, Xianfang Zeng, Kangcong Li, Gang Yu, and Tao Chen. 2025. Sc- captioner: Improving image captioning with self-correction by reinforcement learning. InProceedings of the IEEE/CVF International Conference on Computer Vision. 23145–23155
2025
-
[65]
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675(2019)
work page internal anchor Pith review arXiv 2019
-
[66]
Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, Jun Li, and Xuerui Mao. 2024. EarthMarker: A visual prompting multimodal large language model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing63 (2024), 1–19
2024
-
[67]
Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. 2024. Earth- GPT: A universal multimodal large language model for multisensor image com- prehension in remote sensing domain.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–20
2024
- [68]
-
[69]
Guangwenjie Zou, Liang Yao, Fan Liu, Chuanyi Zhang, Xin Li, Ning Chen, Shengx- iang Xu, and Jun Zhou. 2025. Remotetrimmer: Adaptive structural pruning for remote sensing image classification. InICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5. 10
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.