arxiv: 2605.10576 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: no theorem link

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

Chen Zhong, Guangyi Yang, Jiaxing Sun, Wei He, Xiao An, Zihan Gui

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:55 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords remote sensingvision-language modelsimage quality assessmentlow-level perceptionbenchmarkdegradationsVLMsdomain bias

0 comments

The pith

SenseBench shows that vision-language models carry strong biases toward everyday photos and fail to reliably perceive or describe degradations in remote sensing images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SenseBench, a benchmark built from a physics-based taxonomy of remote sensing degradations, to test whether large vision-language models can move beyond their training on ground-level photos and handle the specific artifacts that appear in satellite and aerial imagery. Over 10,000 curated examples span six major and twenty-two fine-grained degradation types, with two evaluation tracks: objective tasks that measure whether models can detect and localize low-level issues, and subjective tasks that ask for diagnostic language descriptions. Tests on twenty-nine current VLMs uncover consistent patterns of domain skew, collapse when multiple distortions are present, fluent-sounding but inaccurate descriptions, and an inversion in which perception accuracy and description quality trade off against each other. These findings matter because remote-sensing analysts need interpretable, language-based quality judgments rather than single numeric scores, and the benchmark supplies both the data and the protocols to measure progress toward that goal.

Core claim

Comprehensive evaluation of 29 state-of-the-art VLMs on SenseBench reveals that these models exhibit skewed domain priors favoring natural ground-level images, multi-distortion collapse, fluency illusion in generated descriptions, and a perception-description inversion effect when applied to remote sensing degradations.

What carries the argument

SenseBench, the benchmark consisting of a physics-based hierarchical taxonomy that unifies non-reference and reference-based image-quality paradigms together with more than 10,000 curated instances across 6 major and 22 fine-grained remote-sensing degradation categories and two complementary protocols for objective perception and subjective diagnostic description.

If this is right

VLMs will need explicit domain adaptation or remote-sensing-specific pretraining before they can serve as reliable interpreters of image quality in satellite or aerial data.
Language-based diagnostic descriptions from VLMs could replace or augment scalar IQA scores only after the observed fluency illusion and inversion effects are mitigated.
Multi-distortion scenarios common in real remote-sensing pipelines will continue to expose weaknesses in current models until the benchmark data are used for targeted training.
The two-protocol design (objective perception plus subjective description) provides a template for future benchmarks that want to separate detection accuracy from explanatory fluency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same taxonomy-driven construction method could be applied to create parallel benchmarks for medical, underwater, or astronomical imagery where VLMs also encounter domain gaps.
Fine-tuning VLMs on the SenseBench training split might reduce the reported inversion effect and improve both perception and description scores simultaneously.
If the inversion effect persists across domains, it would suggest a fundamental architectural tension in current VLMs between low-level feature extraction and high-level language generation.

Load-bearing premise

The physics-based hierarchical taxonomy and the 10K curated instances accurately and representatively capture real-world remote sensing degradations without selection bias or annotation errors.

What would settle it

Re-running the same 29 VLMs on an independently assembled collection of remote-sensing images that matches the degradation taxonomy but uses different sensors and scenes, and finding no statistically significant domain skew, multi-distortion collapse, fluency illusion, or perception-description inversion.

Figures

Figures reproduced from arXiv: 2605.10576 by Chen Zhong, Guangyi Yang, Jiaxing Sun, Wei He, Xiao An, Zihan Gui.

**Figure 2.** Figure 2: Overview of the SenseBench evaluation framework. The upper part shows the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of SenseBench construction. Remote sensing images are sourced globally from [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 1.** Figure 1: To structurally prevent inference-time leakage, high-quality references are strictly withheld [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 4.** Figure 4: Cross-perspective analysis of VLMs on SenseBench. Circles, squares, and triangles denote [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of L-2 distortions within the L-1 dimension of Image Noise. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Examples of L-2 distortions within the L-1 dimension of Image Blur [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Examples of L-2 distortions within the L-1 dimension of Image Cloud. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Examples of L-2 distortions within the L-1 dimension of Image Compression. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Examples of L-2 distortions within the L-1 dimension of Image Correction. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Examples of L-2 distortions within the L-1 dimension of Image Missing. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Details of the construction of SensePerception. from Google Earth6 . All collected image patches were finally cropped or resized to a unified spatial size of 512 × 512 for subsequent distortion synthesis and benchmark construction. This process yielded 26,493 RS image patches, which formed the initial source image pool for the subsequent filtering and benchmark construction pipeline. B.2 Data Filtering an… view at source ↗

**Figure 12.** Figure 12: Details of the construction of SenseDescription. were generated with predefined parameter settings so that the corresponding distortion category and severity level could be explicitly recorded. Each distorted image was paired with an information file, which contains the Image Address, L1_distortion, L2_distortion, and Severity Level. These image–metadata pairs were then used as the basic data units for co… view at source ↗

**Figure 13.** Figure 13: SenseBench Annotation System used for human evaluator baseline collection and evalua [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

read the original abstract

Low-level visual perception underpins reliable remote sensing (RS) image analysis, yet current image quality assessment (IQA) methods output uninterpretable scalar scores rather than characterizing physics-driven RS degradations, deviating markedly from the diagnostic needs of RS experts. While Vision-Language Models (VLMs) present a compelling alternative by delivering language-grounded IQA, their visual priors are heavily biased toward ground-level natural images. Consequently, whether VLMs can overcome this domain gap to perceive and articulate RS artifacts remains insufficiently studied. To bridge this gap, we propose \textbf{SenseBench}, the first dedicated diagnostic benchmark for RS low-level visual perception and description. Driven by a physics-based hierarchical taxonomy that unifies both non-reference and reference-based paradigms, SenseBench features over 10K meticulously curated instances across 6 major and 22 fine-grained RS degradation categories. Specifically, two complementary protocols are designed for evaluation: objective low-level visual \textit{perception} and subjective diagnostic \textit{description}. Comprehensive evaluation of 29 state-of-the-art VLMs reveals not only skewed domain priors and multi-distortion collapse, but also \textit{fluency illusion} and a \textit{perception-description inversion} effect. We hope SenseBench provides a robust evaluation testbed and high-quality diagnostic data to advance the development of VLMs in RS low-level perception. Code and datasets are available \href{https://github.com/Zhong-Chenchen/SenseBench}{\textcolor{blue}{here}}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SenseBench gives the first dedicated benchmark for low-level perception and description in remote-sensing VLMs, but its claims about model failures rest on curation details that lack reported validation.

read the letter

The main things here are a new benchmark called SenseBench for testing VLMs on remote sensing low-level tasks and an evaluation of 29 models that flags domain bias, multi-distortion problems, and a couple of odd effects. They built it around a physics-based taxonomy with 6 major and 22 fine-grained categories and two protocols that mix objective checks with diagnostic description. That setup directly addresses what RS experts actually need instead of scalar IQA scores, and releasing the 10K instances plus code is a practical step that lets others run their own tests.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces SenseBench, the first dedicated benchmark for remote sensing low-level visual perception and description in VLMs. It is driven by a physics-based hierarchical taxonomy unifying non-reference and reference-based paradigms, featuring over 10K curated instances across 6 major and 22 fine-grained RS degradation categories. Two complementary protocols evaluate objective low-level visual perception and subjective diagnostic description. Comprehensive evaluation of 29 state-of-the-art VLMs reveals skewed domain priors, multi-distortion collapse, fluency illusion, and a perception-description inversion effect. Code and datasets are released publicly.

Significance. If the benchmark validity holds, this work would be significant for addressing the domain gap in VLMs for remote sensing applications, where standard IQA methods fall short of expert diagnostic needs. It supplies a reproducible testbed and high-quality data to diagnose specific VLM failure modes. The public release of code and datasets on GitHub is a clear strength that supports reproducibility and further research in the field.

major comments (2)

[Abstract] Abstract: The central claims rest on evaluation outcomes for 29 VLMs showing domain priors, multi-distortion collapse, fluency illusion, and perception-description inversion, yet the abstract supplies no information on curation methodology, inter-annotator agreement, statistical tests, or controls for these effects; this leaves the results dependent on unverified assertions about the 10K instances.
[Benchmark construction] Benchmark construction: The physics-based hierarchical taxonomy and 10K instances are presented as accurately capturing real-world RS degradations across categories, but no quantitative evidence (e.g., inter-annotator agreement, comparison to real satellite sensor logs, or external validation) is provided to rule out selection bias or annotation artifacts that could produce benchmark-specific rather than general effects.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address the major comments point by point, agreeing to make revisions to improve the clarity and validation of our benchmark.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims rest on evaluation outcomes for 29 VLMs showing domain priors, multi-distortion collapse, fluency illusion, and perception-description inversion, yet the abstract supplies no information on curation methodology, inter-annotator agreement, statistical tests, or controls for these effects; this leaves the results dependent on unverified assertions about the 10K instances.

Authors: The abstract is intentionally concise to highlight the key contributions and findings. Detailed information on the curation methodology, including the physics-based taxonomy, instance selection criteria, inter-annotator agreement for the diagnostic descriptions, and statistical tests used in the VLM evaluations, is provided in the Methods and Experiments sections of the manuscript. We will revise the abstract to include a short statement on the curation process and validation to make the claims more self-contained. revision: yes
Referee: [Benchmark construction] Benchmark construction: The physics-based hierarchical taxonomy and 10K instances are presented as accurately capturing real-world RS degradations across categories, but no quantitative evidence (e.g., inter-annotator agreement, comparison to real satellite sensor logs, or external validation) is provided to rule out selection bias or annotation artifacts that could produce benchmark-specific rather than general effects.

Authors: We thank the referee for this observation. The taxonomy is constructed based on physical models of RS image formation and degradation processes documented in the remote sensing literature. The 10K instances were curated through a multi-stage process involving expert annotation. To strengthen the manuscript, we will add quantitative evidence including inter-annotator agreement metrics for the annotations and additional external validation steps. Direct comparison to proprietary satellite sensor logs is challenging due to data access restrictions, but we will incorporate comparisons to publicly available RS degradation datasets and expert surveys to mitigate concerns about selection bias. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical benchmark with direct evaluation

full rationale

The paper constructs SenseBench via a physics-based taxonomy and curation of 10K instances, then reports direct empirical results from evaluating 29 VLMs on perception and description protocols. No equations, fitted parameters, self-referential derivations, or load-bearing self-citations appear in the provided text. Claims about domain priors, collapse, fluency illusion, and inversion effects are observations from the evaluation rather than reductions to inputs by construction, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on a newly introduced physics-based hierarchical taxonomy and a large curated dataset; no free parameters are fitted, no new physical entities are postulated, and the only background assumption is that the taxonomy faithfully reflects RS degradation physics.

axioms (1)

domain assumption A physics-based hierarchical taxonomy unifies non-reference and reference-based RS degradation paradigms
Invoked to organize the 6 major and 22 fine-grained categories; assumed to be accurate without further validation details in the abstract.

pith-pipeline@v0.9.0 · 5587 in / 1290 out tokens · 102409 ms · 2026-05-12T03:55:56.467447+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 8 internal anchors

[1]

Vision foundation models in remote sensing: A survey.IEEE Geoscience and Remote Sensing Magazine, 13(3):190–215, 2025

Siqi Lu, Junlin Guo, James R Zimmer-Dauphinee, Jordan M Nieusma, Xiao Wang, Parker VanValkenburgh, Steven A Wernke, and Yuankai Huo. Vision foundation models in remote sensing: A survey.IEEE Geoscience and Remote Sensing Magazine, 13(3):190–215, 2025

work page 2025
[2]

Deep learning in remote sensing: A comprehensive review and list of resources.IEEE geoscience and remote sensing magazine, 5(4):8–36, 2017

Xiao Xiang Zhu, Devis Tuia, Lichao Mou, Gui-Song Xia, Liangpei Zhang, Feng Xu, and Friedrich Fraundorfer. Deep learning in remote sensing: A comprehensive review and list of resources.IEEE geoscience and remote sensing magazine, 5(4):8–36, 2017

work page 2017
[3]

Perspectives in machine learning for wildlife conservation.Nature communications, 13(1):792, 2022

Devis Tuia, Benjamin Kellenberger, Sara Beery, Blair R Costelloe, Silvia Zuffi, Benjamin Risse, Alexander Mathis, Mackenzie W Mathis, Frank Van Langevelde, Tilo Burghardt, et al. Perspectives in machine learning for wildlife conservation.Nature communications, 13(1):792, 2022

work page 2022
[4]

Bpme: A blind perceptual metric-driven method for enhancing degraded polar optical remote sensing imagery.IEEE Transactions on Geoscience and Remote Sensing, 64:1–13, 2025

Zijun Wei, Chaozhen Lan, Fushan Yao, Longhao Wang, Tian Gao, and Hanyang Yu. Bpme: A blind perceptual metric-driven method for enhancing degraded polar optical remote sensing imagery.IEEE Transactions on Geoscience and Remote Sensing, 64:1–13, 2025

work page 2025
[5]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021

work page 2021
[6]

Blind image quality as- sessment via vision-language correspondence: A multitask learning perspective

Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, and Kede Ma. Blind image quality as- sessment via vision-language correspondence: A multitask learning perspective. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14071–14081, 2023

work page 2023
[7]

Exploring clip for assessing the look and feel of images

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 2555–2563, 2023

work page 2023
[8]

Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66, 2024

Xiang Li, Congcong Wen, Yuan Hu, Zhenghang Yuan, and Xiao Xiang Zhu. Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66, 2024

work page 2024
[9]

Q-instruct: Improving low-level visual abilities for multi-modality foundation models

Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Kaixin Xu, Chunyi Li, Jingwen Hou, Guangtao Zhai, et al. Q-instruct: Improving low-level visual abilities for multi-modality foundation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25490–25500, 2024

work page 2024
[10]

Depicting beyond scores: Advancing image quality assessment through multi-modal language models

Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tianfan Xue, and Chao Dong. Depicting beyond scores: Advancing image quality assessment through multi-modal language models. In European Conference on Computer Vision, pages 259–276. Springer, 2024

work page 2024
[11]

Pretrain a remote sensing foundation model by promoting intra-instance similarity.IEEE Transactions on Geoscience and Remote Sensing, 62:1–15, 2024

Xiao An, Wei He, Jiaqi Zou, Guangyi Yang, and Hongyan Zhang. Pretrain a remote sensing foundation model by promoting intra-instance similarity.IEEE Transactions on Geoscience and Remote Sensing, 62:1–15, 2024

work page 2024
[12]

Task-guided prompting for unified remote sensing image restoration.IEEE Transactions on Geoscience and Remote Sensing, 64:1–17, 2025

Wenli Huang, Yang Wu, Xiaomeng Xin, Zhihong Liu, Jinjun Wang, and Ye Deng. Task-guided prompting for unified remote sensing image restoration.IEEE Transactions on Geoscience and Remote Sensing, 64:1–17, 2025

work page 2025
[13]

Reobench: Benchmarking robustness of earth observation foundation models.arXiv preprint arXiv:2505.16793, 2025

Xiang Li, Yong Tao, Siyuan Zhang, Siwei Liu, Zhitong Xiong, Chunbo Luo, Lu Liu, Mykola Pechenizkiy, Xiao Xiang Zhu, and Tianjin Huang. Reobench: Benchmarking robustness of earth observation foundation models.arXiv preprint arXiv:2505.16793, 2025

work page arXiv 2025
[14]

Vrsbench: A versatile vision-language bench- mark dataset for remote sensing image understanding.Advances in Neural Information Pro- cessing Systems, 37:3229–3242, 2024

Xiang Li, Jian Ding, and Mohamed Elhoseiny. Vrsbench: A versatile vision-language bench- mark dataset for remote sensing image understanding.Advances in Neural Information Pro- cessing Systems, 37:3229–3242, 2024

work page 2024
[15]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024. 10

work page 2024
[16]

Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

work page 2024
[17]

Q-bench: A benchmark for multi-modal foundation models on low-level vision from single images to pairs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Zicheng Zhang, Haoning Wu, Erli Zhang, Guangtao Zhai, and Weisi Lin. Q-bench: A benchmark for multi-modal foundation models on low-level vision from single images to pairs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[18]

Enhancing descriptive image quality assessment with a large-scale multi-modal dataset.IEEE Transactions on Image Processing, 34:8201–8215, 2025

Zhiyuan You, Jinjin Gu, Xin Cai, Zheyuan Li, Kaiwen Zhu, Chao Dong, and Tianfan Xue. Enhancing descriptive image quality assessment with a large-scale multi-modal dataset.IEEE Transactions on Image Processing, 34:8201–8215, 2025

work page 2025
[19]

Towards open-ended visual quality comparison

Haoning Wu, Hanwei Zhu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Annan Wang, Wenxiu Sun, Qiong Yan, et al. Towards open-ended visual quality comparison. InEuropean Conference on Computer Vision, pages 360–377. Springer, 2024

work page 2024
[20]

Lhrs-bot: Em- powering remote sensing with vgi-enhanced large multimodal language model

Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. Lhrs-bot: Em- powering remote sensing with vgi-enhanced large multimodal language model. InEuropean Conference on Computer Vision, pages 440–457. Springer, 2024

work page 2024
[21]

Geochat: Grounded large vision-language model for remote sensing

Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27831–27840, 2024

work page 2024
[22]

Choice: Benchmarking the remote sensing capabilities of large vision-language models.arXiv preprint arXiv:2411.18145, 2024

Xiao An, Jiaxing Sun, Zihan Gui, and Wei He. Choice: Benchmarking the remote sensing capabilities of large vision-language models.arXiv preprint arXiv:2411.18145, 2024

work page arXiv 2024
[23]

Disasterm3: A remote sensing vision-language dataset for disaster damage assessment and response.arXiv preprint arXiv:2505.21089, 2025

Junjue Wang, Weihao Xuan, Heli Qi, Zhihao Liu, Kunyi Liu, Yuhan Wu, Hongruixuan Chen, Jian Song, Junshi Xia, Zhuo Zheng, et al. Disasterm3: A remote sensing vision-language dataset for disaster damage assessment and response.arXiv preprint arXiv:2505.21089, 2025

work page arXiv 2025
[24]

Dynamicvl: Benchmarking multimodal large language models for dynamic city understanding.arXiv preprint arXiv:2505.21076, 2025

Weihao Xuan, Junjue Wang, Heli Qi, Zihang Chen, Zhuo Zheng, Yanfei Zhong, Junshi Xia, and Naoto Yokoya. Dynamicvl: Benchmarking multimodal large language models for dynamic city understanding.arXiv preprint arXiv:2505.21076, 2025

work page arXiv 2025
[25]

No-reference image quality assessment in the spatial domain.IEEE Transactions on image processing, 21(12):4695–4708, 2012

Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. No-reference image quality assessment in the spatial domain.IEEE Transactions on image processing, 21(12):4695–4708, 2012

work page 2012
[26]

completely blind

Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Making a “completely blind” image quality analyzer.IEEE Signal processing letters, 20(3):209–212, 2012

work page 2012
[27]

No-reference remote sensing image quality assessment based on gradient-weighted natural scene statistics in spatial domain.Journal of Electronic Imaging, 28(1):013033–013033, 2019

Junhua Yan, Xuehan Bai, Yongqi Xiao, Yin Zhang, and Xiangyang Lv. No-reference remote sensing image quality assessment based on gradient-weighted natural scene statistics in spatial domain.Journal of Electronic Imaging, 28(1):013033–013033, 2019

work page 2019
[28]

Bm-iqe: An image quality evaluator with block-matching for both real-life scenes and remote sensing scenes.Sensors, 20 (12):3472, 2020

Ningshan Xu, Dongao Ma, Guoqiang Ren, and Yongmei Huang. Bm-iqe: An image quality evaluator with block-matching for both real-life scenes and remote sensing scenes.Sensors, 20 (12):3472, 2020

work page 2020
[29]

Yiming Xiong, Feng Shao, Xiangchao Meng, Qiuping Jiang, Weiwei Sun, Randi Fu, and Yo- Sung Ho. A large-scale remote sensing database for subjective and objective quality assessment of pansharpened images.Journal of Visual Communication and Image Representation, 73: 102947, 2020

work page 2020
[30]

Wenzhong Tian, Arturo Sanchez-Azofeifa, Za Kan, Qingzhan Zhao, Guoshun Zhang, Yuzhen Wu, and Kai Jiang. Nr-iqa for uav hyperspectral image based on distortion constructing, feature screening, and machine learning.International Journal of Applied Earth Observation and Geoinformation, 133:104130, 2024. 11

work page 2024
[31]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

work page 2024
[34]

arXiv preprint arXiv:2312.17090 (2023)

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090, 2023

work page arXiv 2023
[35]

Revisiting mllm based image quality assessment: Errors and remedy

Zhenchen Tang, Songlin Yang, Bo Peng, Zichuan Wang, and Jing Dong. Revisiting mllm based image quality assessment: Errors and remedy. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 9475–9483, 2026

work page 2026
[36]

Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

work page 2025
[37]

Earthdial: Turning multi-sensory earth observations to interactive dialogues

Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shah- baz Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14303–14313, 2025

work page 2025
[38]

Falcon: A remote sensing vision-language foundation model.arXiv preprint arXiv:2503.11070, 2025

Kelu Yao, Nuo Xu, Rong Yang, Yingying Xu, Zhuoyan Gao, Titinunt Kitrungrotsakul, Yi Ren, Pu Zhang, Jin Wang, Ning Wei, et al. Falcon: A remote sensing vision-language foundation model.arXiv preprint arXiv:2503.11070, 2025

work page arXiv 2025
[40]

Rs-moe: A vision-language model with mixture of experts for remote sensing image captioning and visual question answering.IEEE Transactions on Geoscience and Remote Sensing, 2025

Hui Lin, Danfeng Hong, Shuhang Ge, Chuyao Luo, Kai Jiang, Hao Jin, and Congcong Wen. Rs-moe: A vision-language model with mixture of experts for remote sensing image captioning and visual question answering.IEEE Transactions on Geoscience and Remote Sensing, 2025

work page 2025
[41]

Rsunivlm: A unified vision language model for remote sensing via granularity-oriented mixture of experts.arXiv preprint arXiv:2412.05679, 2024

Xu Liu and Zhouhui Lian. Rsunivlm: A unified vision language model for remote sensing via granularity-oriented mixture of experts.arXiv preprint arXiv:2412.05679, 2024

work page arXiv 2024
[42]

Skymoe: A vision-language foundation model for enhancing geospatial interpretation with mixture of experts.arXiv preprint arXiv:2512.02517, 2025

Jiaqi Liu, Ronghao Fu, Lang Sun, Haoran Liu, Xiao Yang, Weipeng Zhang, Xu Na, Zhuoran Duan, and Bo Yang. Skymoe: A vision-language foundation model for enhancing geospatial interpretation with mixture of experts.arXiv preprint arXiv:2512.02517, 2025

work page arXiv 2025
[43]

Remote sensing image super-resolution and object detection: Benchmark and state of the art.Expert Systems with Applications, 197:116793, 2022

Yi Wang, Syed Muhammad Arsalan Bashir, Mahrukh Khan, Qudrat Ullah, Rui Wang, Yilin Song, Zhe Guo, and Yilong Niu. Remote sensing image super-resolution and object detection: Benchmark and state of the art.Expert Systems with Applications, 197:116793, 2022

work page 2022
[44]

Zhenghua Huang, Yang Yang, Hao Yu, Qian Li, Yu Shi, Yaozong Zhang, and Hao Fang. Rcst: Residual context-sharing transformer cascade to approximate taylor expansion for remote sensing image denoising.IEEE Transactions on Geoscience and Remote Sensing, 63:1–15, 2025. 12

work page 2025
[45]

Jointly rs image deblurring and super-resolution with adjustable-kernel and multi-domain attention

Yan Zhang, Pengcheng Zheng, Chengxiao Zeng, Bin Xiao, Zhenghao Li, and Xinbo Gao. Jointly rs image deblurring and super-resolution with adjustable-kernel and multi-domain attention. IEEE Transactions on Geoscience and Remote Sensing, 2024

work page 2024
[46]

Remote sensing image compression: A review

Shichao Zhou, Chenwei Deng, Baojun Zhao, Yatong Xia, Qisheng Li, and Zhenzhong Chen. Remote sensing image compression: A review. In2015 IEEE International conference on multimedia big data, pages 406–410. IEEE, 2015

work page 2015
[47]

Kinga Karwowska, Damian Wierzbicki, and Michal Kedzierski. Image inpainting and digital camouflage: Methods, applications, and perspectives for remote sensing.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2025

work page 2025
[48]

Google earth engine: Planetary-scale geospatial analysis for everyone.Remote sensing of Environment, 202:18–27, 2017

Noel Gorelick, Matt Hancher, Mike Dixon, Simon Ilyushchenko, David Thau, and Rebecca Moore. Google earth engine: Planetary-scale geospatial analysis for everyone.Remote sensing of Environment, 202:18–27, 2017

work page 2017
[49]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

work page 2002
[50]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

work page 2004
[51]

A survey on llm-as-a-judge.The Innovation, 2024

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024

work page 2024
[52]

Prometheus 2: An open source language model specialized in evaluating other language models

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4334–4353, 2024

work page 2024
[53]

Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, et al. Ovis2. 5 technical report.arXiv preprint arXiv:2508.11737, 2025

work page arXiv 2025
[54]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint arXiv:2503.01743, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Teochat: A large vision-language assistant for temporal earth observation data,

Jeremy Andrew Irvin, Emily Ruoyu Liu, Joyce Chuyi Chen, Ines Dormoy, Jinyoung Kim, Samar Khanna, Zhuo Zheng, and Stefano Ermon. Teochat: A large vision-language assistant for temporal earth observation data.arXiv preprint arXiv:2410.06234, 2024

work page arXiv 2024
[59]

Lhrs-bot-nova: Improved multimodal large language model for remote sensing vision-language interpretation.ISPRS Journal of Photogrammetry and Remote Sensing, 227:539–550, 2025

Zhenshi Li, Dilxat Muhtar, Feng Gu, Yanglangxing He, Xueliang Zhang, Pengfeng Xiao, Guangjun He, and Xiaoxiang Zhu. Lhrs-bot-nova: Improved multimodal large language model for remote sensing vision-language interpretation.ISPRS Journal of Photogrammetry and Remote Sensing, 227:539–550, 2025

work page 2025
[60]

Ovis: Structural embedding alignment for multimodal large language model, 2024.arXiv preprint arXiv:2405.20797, 2024

Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024. 13

work page arXiv 2024
[61]

Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104, 2025

work page 2025
[62]

Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

work page arXiv 2025
[63]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language under- standing.arXiv preprint arXiv:2403.05525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

A statistical evaluation of recent full reference image quality assessment algorithms.IEEE Trans

Hamid R Sheikh, Muhammad F Sabir, Alan C Bovik, et al. A statistical evaluation of recent full reference image quality assessment algorithms.IEEE Trans. Image Process., 15(11):3440–3451, 2006

work page 2006
[65]

Image database tid2013: Peculiarities, results and perspectives.Signal processing: Image communication, 30: 57–77, 2015

Nikolay Ponomarenko, Lina Jin, Oleg Ieremeiev, Vladimir Lukin, Karen Egiazarian, Jaakko Astola, Benoit V ozel, Kacem Chehdi, Marco Carli, Federica Battisti, et al. Image database tid2013: Peculiarities, results and perspectives.Signal processing: Image communication, 30: 57–77, 2015

work page 2015
[66]

Kadid-10k: A large-scale artificially distorted iqa database

Hanhe Lin, Vlad Hosu, and Dietmar Saupe. Kadid-10k: A large-scale artificially distorted iqa database. In2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pages 1–3. IEEE, 2019

work page 2019
[67]

Q-bench: A benchmark for general-purpose foundation models on low-level vision

Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, et al. Q-bench: A benchmark for general-purpose foundation models on low-level vision.arXiv preprint arXiv:2309.14181, 2023

work page arXiv 2023
[68]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 14 Appendix Contents A Benchmark Taxonomy Details 16 A.1 Definition of Each L-2 Distortions . . . . . . . . . . . . . . . . . . . . . . . ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Identify the dominant type of distortion (e.g., haze, motion blur, compression artifacts, sensor noise) and describe its visible characteristics

work page
[70]

Explain how this distortion impairs interpretability or analytical utility, citing concrete visual consequences

work page
[71]

sharp" when the reference says

Provide an overall assessment of the image quality.Your response must contain exactly three sentences, with no numbering, and must not speculate about how the image ’should’ look. [IMAGE1_TOKEN] Answer: Moderate gaussian blur softens the edges of buildings and roads, making structural details less distinct. The blurring reduces interpretability by obscuri...

work page
[72]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page