UltraVR: A Diagnostic Ultra-Resolution Image-VQA Benchmark for Evidence-Grounded Reasoning

Beidi Zhao; Chen Zhou; Gang Wang; Gexin Huang; Jun Zhou; Myeongkyun Kang; Xiaoxiao Li; Yanting Yang; Zu-Hua Gao

arxiv: 2606.05576 · v1 · pith:562K3HRRnew · submitted 2026-06-04 · 💻 cs.CV

UltraVR: A Diagnostic Ultra-Resolution Image-VQA Benchmark for Evidence-Grounded Reasoning

Gexin Huang , Yanting Yang , Myeongkyun Kang , Beidi Zhao , Jun Zhou , Chen Zhou , Gang Wang , Zu-hua Gao

show 1 more author

Xiaoxiao Li

This is my paper

Pith reviewed 2026-06-28 02:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsultra-resolution imagesvisual question answeringevidence-grounded reasoningdiagnostic benchmarkmulti-stage reasoningCCTV surveillanceremote sensing

0 comments

The pith

Current vision-language models fail primarily at evidence grounding and local perception when reasoning over ultra-resolution images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UltraVR, a benchmark for testing vision-language models on visual question answering with ultra-resolution images where key evidence can be tiny, distant, or distributed across scenes. It covers four domains—CCTV surveillance, remote sensing, whole-slide pathology images, and industrial anomaly detection—each presenting distinct challenges like fine-grained grounding or multi-scale navigation. Every instance includes not just a question and answer but a structured chain of thought broken into five labeled stages: evidence grounding, local perception, quantification, evidence integration, and decision inference. This structure moves beyond final-answer scores to diagnose exactly where the visual-to-reasoning pipeline breaks. A sympathetic reader would care because the evaluations show models often recover on later steps once visual facts are supplied, indicating the core difficulty lies in acquiring those facts from the image itself.

Core claim

UltraVR is a diagnostic benchmark spanning four domains with structured ground-truth chains of thought that decompose reasoning into five stages, enabling process-level diagnosis. Frontier VLMs evaluated on it remain far from reliable on ultra-resolution reasoning, with errors concentrating in evidence grounding and local perception while downstream inference often recovers when intermediate visual facts are supplied.

What carries the argument

Structured ground-truth chain of thought with step-level questions, intermediate answers, and reasoning labels that decompose reasoning into evidence grounding, local perception, quantification, evidence integration, and decision inference.

If this is right

Errors concentrate in evidence grounding and local perception rather than later inference stages.
Downstream decision inference often succeeds when intermediate visual facts are supplied as input.
The benchmark enables localization of failures across the visual-to-decision pipeline instead of relying on final-answer accuracy alone.
The four domains highlight complementary challenges including fine-grained object grounding, long-range spatial comparison, multi-scale evidence navigation, and subtle irregularity detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectural improvements focused on low-level visual feature extraction could reduce many of the observed errors without changing higher-level reasoning components.
The five-stage annotation format could be adapted to create similar diagnostic tests for other multimodal tasks involving high-detail inputs.
Testing whether models trained with explicit grounding objectives close the gap on UltraVR would directly probe the paper's localization of failures.

Load-bearing premise

The four selected domains and the five-stage decomposition of reasoning accurately capture the core challenges of ultra-resolution visual reasoning and provide a valid diagnostic lens for model failures.

What would settle it

A frontier VLM achieving high accuracy on the evidence-grounding and local-perception sub-questions across UltraVR instances would show that failures do not concentrate in those early stages.

Figures

Figures reproduced from arXiv: 2606.05576 by Beidi Zhao, Chen Zhou, Gang Wang, Gexin Huang, Jun Zhou, Myeongkyun Kang, Xiaoxiao Li, Yanting Yang, Zu-Hua Gao.

**Figure 1.** Figure 1: Overview and construction pipeline of UltraVR. UltraVR converts ultra-resolution images from four domains into evidence-grounded QA instances with structured GT-CoT annotations. Each instance contains a final question, answer options, step-level intermediate questions, verified step answers, and operation labels. The construction pipeline extracts domain-specific evidence candidates, instantiates reasoning… view at source ↗

**Figure 2.** Figure 2: Operation-level evaluation with GT-CoT. (a) Relative operation difficulty under S2. S2 completes the GT-CoT schema in a single response; each bar reports an operation’s error deviation from that model’s mean step error. (b) Operation recoverability under S5. Previous intermediate answers are replaced with ground truth, isolating the current operation from earlier accumulated errors. (c) First-error locatio… view at source ↗

read the original abstract

Vision-language models (VLMs) excel on visual question answering and multimodal reasoning benchmarks. Yet their capability on ultra-resolution images - where critical evidence is tiny, subtle, spatially distant, or distributed - remains unclear. Existing evaluations largely report final-answer accuracy, offering limited insight into whether models acquire and integrate the necessary visual evidence. We introduce UltraVR, a diagnostic benchmark for evidence-grounded visual reasoning over ultra-resolution images. UltraVR spans four high-value scenarios: CCTV surveillance, remote sensing (RS), whole-slide image (WSI) pathology, and industrial anomaly detection (AD). These domains pose complementary challenges: fine-grained object grounding in crowded CCTV scenes, long-range spatial comparison in RS, multi-scale evidence navigation in WSI, and subtle irregularity detection in repetitive industrial layouts. Beyond standard QA triples, each instance includes a structured ground-truth chain of thought with step-level questions, intermediate answers, and reasoning labels. These labels decompose reasoning into evidence grounding, local perception, quantification, evidence integration, and decision inference, enabling process-level diagnosis over black-box scoring. Using UltraVR, we evaluate frontier VLMs and show that current models remain far from reliable on ultra-resolution reasoning. Importantly, the structured annotations allow us to localize failures across the visual-to-decision pipeline: errors concentrate in evidence grounding and local perception, while downstream inference often recovers when intermediate visual facts are supplied. These findings demonstrate UltraVR as a diagnostic testbed for measuring not only whether VLMs answer correctly, but where their ultra-resolution reasoning process breaks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UltraVR adds structured diagnostic labels to ultra-res VQA in four domains, but the main failure-localization claim rests on an unvalidated five-stage split and no visible metrics.

read the letter

UltraVR introduces a benchmark with structured ground-truth chains of thought for ultra-resolution images across CCTV, remote sensing, pathology, and industrial anomaly detection. The five-stage breakdown (evidence grounding, local perception, quantification, evidence integration, decision inference) is the main new piece; it aims to show where models actually fail instead of just reporting final accuracy.

The structured annotations are a practical step. Being able to supply intermediate facts and watch downstream inference recover is the kind of test that could point developers toward fixes in perception rather than reasoning modules. The domains are chosen for real stakes, which makes the setup more relevant than generic VQA sets.

The soft spots are straightforward. The abstract supplies no dataset statistics, no inter-annotator numbers, no model scores, and no check on whether the stage boundaries are reproducible or natural rather than imposed. The stress-test concern lands: if the decomposition is mostly an annotation artifact, then the headline result that errors concentrate early could be circular. The paper would need to show that different annotators agree on the stages and that ablating the taxonomy does not change the localization pattern.

This is for groups that build or audit VLMs and want process-level diagnostics. It deserves peer review because the benchmark framing is concrete and the diagnostic intent is clear, even though the current evidence is thin on validation. A referee can ask for the missing agreement scores and any tests of the stage scheme before deciding how much weight to give the failure-localization findings.

Referee Report

2 major / 0 minor

Summary. The paper introduces UltraVR, a diagnostic benchmark for evidence-grounded visual reasoning on ultra-resolution images spanning four domains (CCTV surveillance, remote sensing, whole-slide pathology, and industrial anomaly detection). Each instance includes structured ground-truth chain-of-thought annotations that decompose reasoning into five stages (evidence grounding, local perception, quantification, evidence integration, decision inference). Evaluation of frontier VLMs shows they remain unreliable, with failures localized primarily to evidence grounding and local perception stages while downstream inference recovers when intermediate facts are supplied.

Significance. If the structured annotations prove reliable and the five-stage decomposition is validated as a faithful partition of the reasoning process, UltraVR could serve as a useful diagnostic testbed that moves VLM evaluation beyond aggregate accuracy to process-level failure localization in high-resolution settings.

major comments (2)

[Abstract] Abstract: the headline diagnostic result (errors concentrate in evidence grounding and local perception) rests entirely on the validity of the five-stage decomposition and the four chosen domains as an accurate, non-overlapping partition of ultra-resolution reasoning; no inter-annotator agreement, ablation of the taxonomy, or external validation of stage boundaries is referenced, so the localization may be an artifact of the annotation scheme rather than a property of model behavior.
[Abstract] Abstract: central claims about model performance and failure localization are stated without any quantitative metrics, dataset statistics, annotation validation procedure, or inter-annotator agreement figures, leaving the empirical support for the reported findings unspecified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in the abstract regarding our taxonomy validation and empirical support. We address each comment below and will revise the abstract accordingly in the next version.

read point-by-point responses

Referee: [Abstract] Abstract: the headline diagnostic result (errors concentrate in evidence grounding and local perception) rests entirely on the validity of the five-stage decomposition and the four chosen domains as an accurate, non-overlapping partition of ultra-resolution reasoning; no inter-annotator agreement, ablation of the taxonomy, or external validation of stage boundaries is referenced, so the localization may be an artifact of the annotation scheme rather than a property of model behavior.

Authors: The five-stage decomposition is derived from established models of visual reasoning in the literature and was iteratively refined during annotation to ensure the stages are sequential and non-overlapping, as described in Section 3 of the manuscript. The four domains were selected precisely because they present distinct ultra-resolution challenges that map to different stages. However, we agree that the abstract should explicitly reference the annotation validation procedure. We will revise the abstract to include a brief statement on the inter-annotator agreement achieved and the results of our internal ablation on stage boundaries (reported in the supplementary material), which supports that the observed failure localization is not an artifact of the scheme. revision: yes
Referee: [Abstract] Abstract: central claims about model performance and failure localization are stated without any quantitative metrics, dataset statistics, annotation validation procedure, or inter-annotator agreement figures, leaving the empirical support for the reported findings unspecified.

Authors: The provided abstract is a high-level summary and therefore omits specific numbers, which is common practice. The full manuscript contains the requested details (dataset size, image resolutions, model accuracies, and annotation statistics) in Sections 4 and 5. To address the concern, we will revise the abstract to incorporate the key quantitative figures (e.g., total instances, primary accuracy ranges, and annotation agreement) so that the central claims are immediately supported by evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction with empirical evaluation only

full rationale

The paper introduces UltraVR as a new benchmark with structured annotations decomposing reasoning into five stages and reports empirical findings on model failures. No mathematical derivations, equations, parameter fitting, or predictions are described anywhere in the provided text. The five-stage decomposition is presented as an annotation scheme for diagnosis rather than derived from prior results or self-citations. No load-bearing claims reduce to self-definition, fitted inputs, or author-specific uniqueness theorems. The work is self-contained as a dataset and evaluation effort without any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The contribution rests on the domain assumption that the chosen scenarios and reasoning-stage labels constitute a faithful decomposition of ultra-resolution visual reasoning; no free parameters or invented physical entities are present.

axioms (1)

domain assumption The five-stage decomposition (evidence grounding, local perception, quantification, evidence integration, decision inference) validly breaks down the reasoning process for diagnostic purposes.
Invoked when describing how the structured annotations enable process-level diagnosis.

invented entities (1)

UltraVR benchmark with structured CoT labels no independent evidence
purpose: Diagnostic testbed for localizing VLM failures on ultra-resolution images
Newly constructed artifact introduced in this work.

pith-pipeline@v0.9.1-grok · 5836 in / 1224 out tokens · 72480 ms · 2026-06-28T02:54:51.979650+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 16 canonical work pages · 9 internal anchors

[1]

Urbanllava: A multi-modal large language model for urban intelligence

Jie Feng, Shengyuan Wang, Tianhui Liu, Yanxin Xi, and Yong Li. Urbanllava: A multi-modal large language model for urban intelligence. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6209–6219, 2025

2025
[2]

Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

2024
[3]

Lu, Bowen Chen, Drew F

Ming Y . Lu, Bowen Chen, Drew F. K. Williamson, Richard J. Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, Anil V . Parwani, Andrew Zhang, and Faisal Mahmood. A visual-language foundation model for computational pathology.Nature Medicine, 2024

2024
[4]

Winclip: Zero-/few-shot anomaly classification and segmentation

Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. Winclip: Zero-/few-shot anomaly classification and segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19606–19616, 2023

2023
[5]

Qwen3.5: Towards native multimodal agents

Qwen Team. Qwen3.5: Towards native multimodal agents. https://qwen.ai/blog?id= qwen3.5, 2026. Accessed: 2026-05-06

2026
[6]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Gemma 4 model card

Google. Gemma 4 model card. https://ai.google.dev/gemma/docs/core/model_ card_4, 2026. Accessed: 2026-05-06

2026
[8]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

GLM-V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Qi Ji, Junhui Ji, et al. GLM-4.1V-Thinking and GLM-4.5V: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. DeepSeek-VL2: Mixture-of-experts vision- language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. MiniCPM-V 4.5: Cooking efficient MLLMs via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Phi-4-reasoning-vision-15B technical report.arXiv preprint arXiv:2603.03975, 2026

Jyoti Aneja, Michael Harrison, Neel Joshi, Tyler LaBonte, John Langford, and Eduardo Salinas. Phi-4-reasoning-vision-15B technical report.arXiv preprint arXiv:2603.03975, 2026

work page arXiv 2026
[12]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-VL technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Can mllms guide me home? a benchmark study on fine- grained visual reasoning from transit maps.arXiv preprint arXiv:2505.18675, 2025

Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song, Jianke Zhu, Huan Wang, and Xinchao Wang. Can mllms guide me home? a benchmark study on fine-grained visual reasoning from transit maps.arXiv preprint arXiv:2505.18675, 2025

work page arXiv 2025
[14]

Pix2Fact: When Vision Is Not Enough -- Benchmarking Fine-Grained VQA with Web Verification on High-Resolution Real-World Scenes

Yifan Jiang, Cong Zhang, Bofei Zhang, Yifan Yang, Bingzhang Wang, and Yew-Soon Ong. From pixels to facts: Benchmarking multi-hop reasoning for fine-grained visual fact checking. arXiv preprint arXiv:2602.00593, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Quantiphy: A quantitative benchmark evaluating physical reasoning abilities of vision-language models.arXiv preprint arXiv:2512.19526, 2025

Puyin Li, Tiange Xiang, Ella Mao, Shirley Wei, Xinye Chen, Adnan Masood, Li Fei-Fei, and Ehsan Adeli. Quantiphy: A quantitative benchmark evaluating physical reasoning abilities of vision-language models.arXiv preprint arXiv:2512.19526, 2025. 11

work page arXiv 2025
[16]

When visualizing is the first step to reasoning: Mira, a benchmark for visual chain-of-thought

Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, Haoqi Fan, Cihang Xie, Huaxiu Yao, and Qinghao Ye. When visualizing is the first step to reasoning: Mira, a benchmark for visual chain-of-thought. arXiv preprint arXiv:2511.02779, 2025

work page arXiv 2025
[17]

V*: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[18]

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models.arXiv preprint arXiv:2408.15556, 2024

Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models.arXiv preprint arXiv:2408.15556, 2024

work page arXiv 2024
[19]

When visual grounding meets gigapixel-level large-scale scenes: Benchmark and approach

Tao Ma, Bing Bai, Haozhe Lin, Heyuan Wang, Yu Wang, Lin Luo, and Lu Fang. When visual grounding meets gigapixel-level large-scale scenes: Benchmark and approach. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[20]

Hrscene: How far are vlms from effective high-resolution image understanding? InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Yusen Zhang, Wenliang Zheng, Aashrith Madasu, Peng Shi, Ryo Kamoi, Hao Zhou, Zhuoyang Zou, Shu Zhao, Sarkar Snigdha Sarathi Das, Vipul Gupta, Xiaoxin Lu, Nan Zhang, Ranran Hao- ran Zhang, Avitej Iyer, Renze Lou, Wenpeng Yin, and Rui Zhang. Hrscene: How far are vlms from effective high-resolution image understanding? InProceedings of the IEEE/CVF Internati...

2025
[21]

Fengxiang Wang, Hongzhen Wang, Zonghao Guo, Di Wang, Yulin Wang, Mingshuo Chen, Qiang Ma, Long Lan, Wenjing Yang, Jing Zhang, Zhiyuan Liu, and Maosong Sun. Xlrs-bench: Could your multimodal llms understand extremely large ultra-high-resolution remote sensing imagery? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[22]

A benchmark for ultra-high-resolution remote sensing mllms.arXiv preprint arXiv:2512.17319, 2025

Yunkai Dang, Meiyi Zhu, Donghao Wang, Yizhuo Zhang, Jiacheng Yang, Qi Fan, Yuekun Yang, Wenbin Li, Feng Miao, and Yang Gao. A benchmark for ultra-high-resolution remote sensing mllms.arXiv preprint arXiv:2512.17319, 2025

work page arXiv 2025
[23]

Ur-bench: A benchmark for multi-hop reasoning over ultra-high-resolution images.arXiv preprint arXiv:2601.08748, 2026

Siqi Li, Xinyu Cai, Jianbiao Mei, Nianchen Deng, Pinlong Cai, Licheng Wen, Yufan Shen, Xuemeng Yang, Botian Shi, and Yong Liu. Ur-bench: A benchmark for multi-hop reasoning over ultra-high-resolution images.arXiv preprint arXiv:2601.08748, 2026

work page arXiv 2026
[24]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Brady, Qionghai Dai, and Lu Fang

Xueyang Wang, Xiya Zhang, Yinheng Zhu, Yuchen Guo, Xiaoyun Yuan, Liuyu Xiang, Zerun Wang, Guiguang Ding, David J. Brady, Qionghai Dai, and Lu Fang. Panda: A gigapixel-level human-centric video dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3268–3278, 2020

2020
[28]

Dota: A large-scale dataset for object detection in aerial images

Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liangpei Zhang. Dota: A large-scale dataset for object detection in aerial images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3974–3983, 2018. 12

2018
[29]

The cancer genome atlas pan-cancer analysis project.Nature genetics, 45(10):1113–1120, 2013

John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, and Joshua M Stuart. The cancer genome atlas pan-cancer analysis project.Nature genetics, 45(10):1113–1120, 2013

2013
[30]

Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and localization

Paul Bergmann, Kilian Batzner, Michael Fauser, David Sattlegger, and Carsten Steger. Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and localization. International Journal of Computer Vision, 2022

2022
[31]

Introducing GPT-5.5

OpenAI. Introducing GPT-5.5. https://openai.com/index/introducing-gpt-5-5/ ,
[33]

Introducing GPT-5.4

OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ ,
[34]

Accessed: 2026-05-06

2026
[35]

final answer

Google DeepMind. Gemini 3.1 Pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, 2026. Accessed: 2026-05-06. 13 Appendix Contents A Detailed Benchmark Construction 15 A.1 CCTV Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Remote Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

2026
[36]

Answer only if Q2 = Yes; otherwise N/A.) → 10x [single_choice] Q4: After zooming in, which fine-grained morphologic features should be checked? (Multi-select, up to 5 most relevant

(Answer only if Q2 = Yes; otherwise N/A.) → [4, 5, 7, 8, 10, 13, 14, 16, 20, 21, 22, 23, 24, 26, 27, 28] [multi_select_int_list] Q3: For the closer review, what practical starting magnification should a junior pathologist use first? (Single-choice. Answer only if Q2 = Yes; otherwise N/A.) → 10x [single_choice] Q4: After zooming in, which fine-grained morp...
[37]

Answer only if Q2 = Yes; otherwise N/A.) → 10x [single_choice] Q4: After zooming in, which fine-grained morphologic features should be checked? (Multi-select, up to 5 most relevant

(Answer only if Q2 = Yes; otherwise N/A.) → [10, 11, 15, 16, 17, 21, 22, 23, 28, 29] [multi_select_int_list] Q3: For the closer review, what practical starting magnification should a junior pathologist use first? (Single-choice. Answer only if Q2 = Yes; otherwise N/A.) → 10x [single_choice] Q4: After zooming in, which fine-grained morphologic features sho...
[38]

current_category_candidate

(Answer only if Q2 = Yes; otherwise N/A.) → [2, 3, 4, 10, 11, 26, 27, 28, 34] [multi_select_int_list] Q3: For the closer review, what practical starting magnification should a junior pathologist use first? (Single-choice. Answer only if Q2 = Yes; otherwise N/A.) → 20x [single_choice] Q4: After zooming in, which fine-grained morphologic features should be ...
[39]

**Logical Verification**
[40]

answer":

**Relational Inference** Each template should satisfy the following general principles: ### General Principle 1: Prefer same-level variables 58 Variables used in one question should belong to the same semantic level whenever possible. Avoid mixing diagnosis type, growth pattern, vascular spread, and unrelated morphology in a single option set. ### General...

[1] [1]

Urbanllava: A multi-modal large language model for urban intelligence

Jie Feng, Shengyuan Wang, Tianhui Liu, Yanxin Xi, and Yong Li. Urbanllava: A multi-modal large language model for urban intelligence. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6209–6219, 2025

2025

[2] [2]

Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

2024

[3] [3]

Lu, Bowen Chen, Drew F

Ming Y . Lu, Bowen Chen, Drew F. K. Williamson, Richard J. Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, Anil V . Parwani, Andrew Zhang, and Faisal Mahmood. A visual-language foundation model for computational pathology.Nature Medicine, 2024

2024

[4] [4]

Winclip: Zero-/few-shot anomaly classification and segmentation

Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. Winclip: Zero-/few-shot anomaly classification and segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19606–19616, 2023

2023

[5] [5]

Qwen3.5: Towards native multimodal agents

Qwen Team. Qwen3.5: Towards native multimodal agents. https://qwen.ai/blog?id= qwen3.5, 2026. Accessed: 2026-05-06

2026

[6] [6]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Gemma 4 model card

Google. Gemma 4 model card. https://ai.google.dev/gemma/docs/core/model_ card_4, 2026. Accessed: 2026-05-06

2026

[8] [8]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

GLM-V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Qi Ji, Junhui Ji, et al. GLM-4.1V-Thinking and GLM-4.5V: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. DeepSeek-VL2: Mixture-of-experts vision- language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. MiniCPM-V 4.5: Cooking efficient MLLMs via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Phi-4-reasoning-vision-15B technical report.arXiv preprint arXiv:2603.03975, 2026

Jyoti Aneja, Michael Harrison, Neel Joshi, Tyler LaBonte, John Langford, and Eduardo Salinas. Phi-4-reasoning-vision-15B technical report.arXiv preprint arXiv:2603.03975, 2026

work page arXiv 2026

[12] [12]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-VL technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Can mllms guide me home? a benchmark study on fine- grained visual reasoning from transit maps.arXiv preprint arXiv:2505.18675, 2025

Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song, Jianke Zhu, Huan Wang, and Xinchao Wang. Can mllms guide me home? a benchmark study on fine-grained visual reasoning from transit maps.arXiv preprint arXiv:2505.18675, 2025

work page arXiv 2025

[14] [14]

Pix2Fact: When Vision Is Not Enough -- Benchmarking Fine-Grained VQA with Web Verification on High-Resolution Real-World Scenes

Yifan Jiang, Cong Zhang, Bofei Zhang, Yifan Yang, Bingzhang Wang, and Yew-Soon Ong. From pixels to facts: Benchmarking multi-hop reasoning for fine-grained visual fact checking. arXiv preprint arXiv:2602.00593, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Quantiphy: A quantitative benchmark evaluating physical reasoning abilities of vision-language models.arXiv preprint arXiv:2512.19526, 2025

Puyin Li, Tiange Xiang, Ella Mao, Shirley Wei, Xinye Chen, Adnan Masood, Li Fei-Fei, and Ehsan Adeli. Quantiphy: A quantitative benchmark evaluating physical reasoning abilities of vision-language models.arXiv preprint arXiv:2512.19526, 2025. 11

work page arXiv 2025

[16] [16]

When visualizing is the first step to reasoning: Mira, a benchmark for visual chain-of-thought

Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, Haoqi Fan, Cihang Xie, Huaxiu Yao, and Qinghao Ye. When visualizing is the first step to reasoning: Mira, a benchmark for visual chain-of-thought. arXiv preprint arXiv:2511.02779, 2025

work page arXiv 2025

[17] [17]

V*: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024

[18] [18]

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models.arXiv preprint arXiv:2408.15556, 2024

Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models.arXiv preprint arXiv:2408.15556, 2024

work page arXiv 2024

[19] [19]

When visual grounding meets gigapixel-level large-scale scenes: Benchmark and approach

Tao Ma, Bing Bai, Haozhe Lin, Heyuan Wang, Yu Wang, Lin Luo, and Lu Fang. When visual grounding meets gigapixel-level large-scale scenes: Benchmark and approach. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024

[20] [20]

Hrscene: How far are vlms from effective high-resolution image understanding? InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Yusen Zhang, Wenliang Zheng, Aashrith Madasu, Peng Shi, Ryo Kamoi, Hao Zhou, Zhuoyang Zou, Shu Zhao, Sarkar Snigdha Sarathi Das, Vipul Gupta, Xiaoxin Lu, Nan Zhang, Ranran Hao- ran Zhang, Avitej Iyer, Renze Lou, Wenpeng Yin, and Rui Zhang. Hrscene: How far are vlms from effective high-resolution image understanding? InProceedings of the IEEE/CVF Internati...

2025

[21] [21]

Fengxiang Wang, Hongzhen Wang, Zonghao Guo, Di Wang, Yulin Wang, Mingshuo Chen, Qiang Ma, Long Lan, Wenjing Yang, Jing Zhang, Zhiyuan Liu, and Maosong Sun. Xlrs-bench: Could your multimodal llms understand extremely large ultra-high-resolution remote sensing imagery? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025

[22] [22]

A benchmark for ultra-high-resolution remote sensing mllms.arXiv preprint arXiv:2512.17319, 2025

Yunkai Dang, Meiyi Zhu, Donghao Wang, Yizhuo Zhang, Jiacheng Yang, Qi Fan, Yuekun Yang, Wenbin Li, Feng Miao, and Yang Gao. A benchmark for ultra-high-resolution remote sensing mllms.arXiv preprint arXiv:2512.17319, 2025

work page arXiv 2025

[23] [23]

Ur-bench: A benchmark for multi-hop reasoning over ultra-high-resolution images.arXiv preprint arXiv:2601.08748, 2026

Siqi Li, Xinyu Cai, Jianbiao Mei, Nianchen Deng, Pinlong Cai, Licheng Wen, Yufan Shen, Xuemeng Yang, Botian Shi, and Yong Liu. Ur-bench: A benchmark for multi-hop reasoning over ultra-high-resolution images.arXiv preprint arXiv:2601.08748, 2026

work page arXiv 2026

[24] [24]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Brady, Qionghai Dai, and Lu Fang

Xueyang Wang, Xiya Zhang, Yinheng Zhu, Yuchen Guo, Xiaoyun Yuan, Liuyu Xiang, Zerun Wang, Guiguang Ding, David J. Brady, Qionghai Dai, and Lu Fang. Panda: A gigapixel-level human-centric video dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3268–3278, 2020

2020

[28] [28]

Dota: A large-scale dataset for object detection in aerial images

Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liangpei Zhang. Dota: A large-scale dataset for object detection in aerial images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3974–3983, 2018. 12

2018

[29] [29]

The cancer genome atlas pan-cancer analysis project.Nature genetics, 45(10):1113–1120, 2013

John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, and Joshua M Stuart. The cancer genome atlas pan-cancer analysis project.Nature genetics, 45(10):1113–1120, 2013

2013

[30] [30]

Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and localization

Paul Bergmann, Kilian Batzner, Michael Fauser, David Sattlegger, and Carsten Steger. Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and localization. International Journal of Computer Vision, 2022

2022

[31] [31]

Introducing GPT-5.5

OpenAI. Introducing GPT-5.5. https://openai.com/index/introducing-gpt-5-5/ ,

[32] [33]

Introducing GPT-5.4

OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ ,

[33] [34]

Accessed: 2026-05-06

2026

[34] [35]

final answer

Google DeepMind. Gemini 3.1 Pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, 2026. Accessed: 2026-05-06. 13 Appendix Contents A Detailed Benchmark Construction 15 A.1 CCTV Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Remote Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

2026

[35] [36]

Answer only if Q2 = Yes; otherwise N/A.) → 10x [single_choice] Q4: After zooming in, which fine-grained morphologic features should be checked? (Multi-select, up to 5 most relevant

(Answer only if Q2 = Yes; otherwise N/A.) → [4, 5, 7, 8, 10, 13, 14, 16, 20, 21, 22, 23, 24, 26, 27, 28] [multi_select_int_list] Q3: For the closer review, what practical starting magnification should a junior pathologist use first? (Single-choice. Answer only if Q2 = Yes; otherwise N/A.) → 10x [single_choice] Q4: After zooming in, which fine-grained morp...

[36] [37]

Answer only if Q2 = Yes; otherwise N/A.) → 10x [single_choice] Q4: After zooming in, which fine-grained morphologic features should be checked? (Multi-select, up to 5 most relevant

(Answer only if Q2 = Yes; otherwise N/A.) → [10, 11, 15, 16, 17, 21, 22, 23, 28, 29] [multi_select_int_list] Q3: For the closer review, what practical starting magnification should a junior pathologist use first? (Single-choice. Answer only if Q2 = Yes; otherwise N/A.) → 10x [single_choice] Q4: After zooming in, which fine-grained morphologic features sho...

[37] [38]

current_category_candidate

(Answer only if Q2 = Yes; otherwise N/A.) → [2, 3, 4, 10, 11, 26, 27, 28, 34] [multi_select_int_list] Q3: For the closer review, what practical starting magnification should a junior pathologist use first? (Single-choice. Answer only if Q2 = Yes; otherwise N/A.) → 20x [single_choice] Q4: After zooming in, which fine-grained morphologic features should be ...

[38] [39]

**Logical Verification**

[39] [40]

answer":

**Relational Inference** Each template should satisfy the following general principles: ### General Principle 1: Prefer same-level variables 58 Variables used in one question should belong to the same semantic level whenever possible. Avoid mixing diagnosis type, growth pattern, vascular spread, and unrelated morphology in a single option set. ### General...