arxiv: 2604.13565 · v1 · submitted 2026-04-15 · 💻 cs.CV · cs.AI

Recognition: unknown

UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing

Yunkai Dang , Minxin Dai , Yuekun Yang , Zhangnan Li , Wenbin Li , Feng Miao , Yang Gao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords token compressionremote sensingvision-language modelultra-high-resolutionbudget-aware selectionmulti-scale importanceregion-wise merging

0 comments

The pith

UHR-BAT selects visual tokens from ultra-high-resolution remote sensing images using text-guided multi-scale importance to stay inside a fixed token budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UHR-BAT as a token compression framework for vision-language models that must handle kilometer-scale remote sensing scenes containing query-critical details as small as a few pixels. It replaces direct downsampling or global pruning with text-guided importance scoring computed at multiple scales, followed by region-wise decisions to preserve or merge tokens. This approach keeps the total visual tokens under a strict limit while attempting to retain the evidence needed for accurate answers. A sympathetic reader would care because remote sensing tasks routinely combine broad context with tiny objects, and current models either lose the small details or exceed practical compute budgets. If the method works as described, it would let the same models answer questions about large images without custom tiling or unpredictable costs.

Core claim

UHR-BAT is a query-guided and region-faithful token compression framework that uses text-guided multi-scale importance estimation to score visual tokens and applies region-wise preserve and merge strategies to reduce redundancy, thereby selecting a budgeted set of tokens that still supports state-of-the-art performance on ultra-high-resolution remote sensing benchmarks.

What carries the argument

Text-guided multi-scale importance estimation combined with region-wise preserve and merge strategies that decide which visual tokens to keep or combine under a fixed context budget.

If this is right

The same compression pipeline can be applied to any vision-language task where image size produces too many tokens for the available context window.
Region-wise merging reduces token count more efficiently than uniform global top-k selection while attempting to keep spatial relationships intact.
Multi-scale scoring allows the model to consider both large-scale scene layout and fine local details in one pass without separate tiling steps.
Performance gains appear across multiple remote sensing benchmarks when the token budget is held constant.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same importance-plus-merge logic could be tested on other high-resolution domains such as medical whole-slide images or aerial photography of urban areas.
If the merge operations preserve enough local structure, the approach might reduce the need for separate object detectors before feeding images to a vision-language model.
Extending the multi-scale estimation to include explicit scale weighting based on query type could further reduce cases where tiny but decisive objects are overlooked.

Load-bearing premise

Text-guided multi-scale importance estimation can reliably locate and preserve every piece of query-critical evidence, including objects only a few pixels wide, without systematic omission caused by the region-wise merge steps.

What would settle it

A controlled test in which a model using the full uncompressed image answers a query correctly about a sub-pixel or few-pixel object, yet the same model using UHR-BAT's compressed tokens answers incorrectly because the object was dropped during importance scoring or merging.

Figures

Figures reproduced from arXiv: 2604.13565 by Feng Miao, Minxin Dai, Wenbin Li, Yang Gao, Yuekun Yang, Yunkai Dang, Zhangnan Li.

**Figure 2.** Figure 2: Overview of our method. We encode a high-resolution remote sensing image at multiple scales using a frozen ViT, applying Scale-Specific Positional Embeddings to distinguish visual tokens across scales, and then obtain an anchor-scale query-to-vision attention map from the MLLM interface. A region partition (e.g., induced by SAM or feature+coordinate clustering) enables region-wise preserveand-merge, while… view at source ↗

**Figure 3.** Figure 3: Ablation on clustering centers and retained tokens. Left: accuracy versus the number of cluster centers k. Right: accuracy versus the number of retained visual tokens [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Multi-scale token selection accuracy on XLRSBench. We evaluate four input resolutions (672×672, 1344×1344, 2688×2688, 4032×4032) and vary the number of retained visual tokens at each scale (x-axis). Question: What is the type of land marked with a red circle in the image? Original Image Enlarged View Prune = 95% Enlarged View Our GeoLLaVA [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparisons of our method and GeoLLaVA8K under 95% pruning ratio. olutions [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative examples of coarse segmentation on remote-sensing images produced by our segmentation module, used to form semantically coherent regions for region-aware processing. Visualization of Region Partition. We provide qualitative examples of the region partition in [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: t-SNE visualization of visual tokens. Each color represents a cluster assigned by our region-based grouping module. Although the high-dimensional nature of tokens leads to some spatial overlap in the 2D projection, tokens with the same semantic labels exhibit clear clustering behavior, validating the redundancy of visual information and the feasibility of our token merging strategy. resulting partitions ge… view at source ↗

**Figure 8.** Figure 8: Qualitative visualization of region-aware attention on remote-sensing images. We display original images and enlarged views alongside their corresponding attention maps. The heatmaps demonstrate that the model accurately focuses on semantically relevant regions—such as specific vehicles or infrastructure—aligned with the textual query, while assigning lower importance to the background. 20 [PITH_FULL_IMAG… view at source ↗

**Figure 9.** Figure 9: Visualization of attention-guided token pruning. We display the original image and the retained visual tokens at increasing pruning ratios. The method effectively focuses on high-attention areas, preserving query-relevant details (red boxes) like specific colors or text, while discarding the non-informative background. Original Image Prune = 95% Prune = 97.5% What does the text printed on the football fiel… view at source ↗

**Figure 10.** Figure 10: Visualization of clustering-guided token pruning. This example demonstrates how region-based partitioning helps retain structural objects. Even at a 97.5% pruning ratio, small targets like boats or roundabouts are preserved within their clusters, ensuring the model retains sufficient context for spatial reasoning. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative comparison of model responses across InternVL3.5-8B, Qwen2.5-VL-72B, and our method. The examples demonstrate that our model correctly handles fine grained tasks such as counting dense oil tanks or identifying specific land types. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 11.** Figure 11: Qualitative comparison of model responses. Our method consistently provides accurate answers for challenging counting tasks within localized regions, effectively distinguishing targets like swimming pools or houses from complex backgrounds. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 11.** Figure 11: Qualitative comparison of model responses. The visualization highlights the ability of our model to ground the correct visual evidence for fine grained queries, specifically including color identification of small objects and reasoning about spatial directions. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative examples of responses generated by our model. These samples illustrate the versatility of our method in handling diverse tasks, including spatial reasoning about road orientation and fine-grained attribute recognition of small objects. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 12.** Figure 12: Qualitative examples of responses generated by our model. The visualization demonstrates the ability of the model to interpret geometric shapes and complex spatial relationships, such as identifying the curvature of a bridge or determining the relative position. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 12.** Figure 12: Qualitative examples of responses generated by our model. These results show that our model effectively captures dynamic scene details, such as assessing water conditions, identifying construction activities, detecting minute objects, and recognizing colors. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

read the original abstract

Ultra-high-resolution (UHR) remote sensing imagery couples kilometer-scale context with query-critical evidence that may occupy only a few pixels. Such vast spatial scale leads to a quadratic explosion of visual tokens and hinders the extraction of information from small objects. Previous works utilize direct downsampling, dense tiling, or global top-k pruning, which either compromise query-critical image details or incur unpredictable compute. In this paper, we propose UHR-BAT, a query-guided and region-faithful token compression framework to efficiently select visual tokens under a strict context budget. Specifically, we leverage text-guided, multi-scale importance estimation for visual tokens, effectively tackling the challenge of achieving precise yet low-cost feature extraction. Furthermore, by introducing region-wise preserve and merge strategies, we mitigate visual token redundancy, further driving down the computational budget. Experimental results show that UHR-BAT achieves state-of-the-art performance across various benchmarks. Code will be available at https://github.com/Yunkaidang/UHR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UHR-BAT pairs text-guided multi-scale scoring with region preserve-and-merge for token compression in UHR remote sensing, but risks losing sub-pixel signals.

read the letter

The main point is that UHR-BAT combines text-guided multi-scale importance scoring with region-wise preserve and merge operations to compress visual tokens for ultra-high-resolution remote sensing images while trying to keep query-critical details. This pairing is the actual novelty. Prior work either downsamples everything and loses small features or tiles densely and blows up compute, or does global top-k which can ignore local context. The region-faithful approach here keeps some spatial grouping during pruning, which is a reasonable engineering step. It handles a genuine issue in the domain: remote sensing images have both broad context and tiny objects that matter for tasks like detection or monitoring. The method looks implementable from the description, and claiming SOTA suggests they ran proper comparisons. The potential weak spot is exactly the one in the stress test. Multi-scale scoring on pooled features risks under-scoring objects that are only a few pixels across, especially at coarser levels. Once you merge regions, that weak signal disappears. If the experiments don't include cases with sparse small objects or show that the text guidance recovers them, the performance claims rest on easier examples. The abstract doesn't detail the benchmarks or ablations, so that needs checking in the full paper. This work is for computer vision folks focused on efficient models for remote sensing or other high-res imagery tasks. Someone looking for token compression tricks in VLMs could borrow the region strategy. It deserves a serious referee because the problem is real and the solution is specific enough to evaluate. I'd recommend putting it through peer review.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces UHR-BAT, a query-guided token compression framework for vision-language models on ultra-high-resolution remote sensing imagery. It uses text-guided multi-scale importance estimation of visual tokens together with region-wise preserve (top-k per region) and merge (average/concatenate) operations to select tokens under a fixed context budget, and claims state-of-the-art performance on various benchmarks.

Significance. If the performance claims are substantiated, the work addresses a practically important scaling problem in remote-sensing VLMs where kilometer-scale context coexists with query-critical evidence that may occupy only a few pixels. A reliable budget-aware compressor that preserves small-object signals would be a useful engineering contribution.

major comments (2)

[Abstract] Abstract: the claim of state-of-the-art performance is asserted without any quantitative metrics, baseline comparisons, ablation studies, or error analysis, making it impossible to judge whether the data support the central claim.
[Method] Method (text-guided multi-scale importance estimation and region-wise preserve/merge): importance scores are computed on downsampled or pooled features at each scale; any object smaller than the coarsest pooling kernel therefore receives diluted scores. The subsequent per-region top-k preserve followed by merge can further discard or average away the already-weak signal. No explicit high-resolution saliency pass or pixel-level recovery mechanism is described. This directly threatens the central claim that query-critical evidence occupying only a few pixels is reliably preserved.

minor comments (1)

[Abstract] The abstract would benefit from a concise statement of the key quantitative gains (e.g., accuracy delta and token reduction factor) to allow readers to assess the SOTA claim at a glance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback. We have carefully addressed each major comment below and revised the manuscript to strengthen the presentation of our results and clarify the method's handling of small objects.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of state-of-the-art performance is asserted without any quantitative metrics, baseline comparisons, ablation studies, or error analysis, making it impossible to judge whether the data support the central claim.

Authors: We agree that the abstract would be strengthened by including key quantitative highlights. In the revised manuscript, we have updated the abstract to briefly report specific performance gains (e.g., average improvements over strong baselines on the primary remote-sensing VLM benchmarks) while remaining within length limits. The full set of quantitative comparisons, ablation studies, and error analyses already appears in Sections 4 and 5; the abstract revision now directs readers to these results more explicitly. revision: yes
Referee: [Method] Method (text-guided multi-scale importance estimation and region-wise preserve/merge): importance scores are computed on downsampled or pooled features at each scale; any object smaller than the coarsest pooling kernel therefore receives diluted scores. The subsequent per-region top-k preserve followed by merge can further discard or average away the already-weak signal. No explicit high-resolution saliency pass or pixel-level recovery mechanism is described. This directly threatens the central claim that query-critical evidence occupying only a few pixels is reliably preserved.

Authors: We appreciate this careful analysis of potential signal dilution for sub-kernel objects. Our multi-scale estimation explicitly includes a finest scale whose pooling kernel is sized to retain few-pixel features; importance scores at this scale are computed directly on high-resolution patches. The text-guided cross-attention then amplifies query-relevant tokens at every scale, including the fine one, before region-wise top-k selection. The per-region preserve step further guarantees that localized high-importance tokens (even isolated small-object signals) are retained rather than globally pruned. We have added a dedicated paragraph in Section 3.2 clarifying the scale-specific kernel sizes and the role of query guidance, together with new qualitative visualizations in Section 4.3 that demonstrate preservation of few-pixel targets on UHR remote-sensing examples. While we did not introduce a separate pixel-level saliency branch (to preserve the low-cost design), the current mechanism supports the reported benchmark results on datasets containing small objects. revision: partial

Circularity Check

0 steps flagged

No significant circularity; independent engineering method validated by experiments

full rationale

The paper presents UHR-BAT as a query-guided and region-faithful token compression framework relying on text-guided multi-scale importance estimation plus region-wise preserve and merge strategies. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear in the text that reduce the claimed SOTA performance or token selection to inputs by construction. The contribution is described as an independent engineering approach whose validity rests on benchmark experiments rather than any self-referential loop or ansatz smuggled via prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method appears to reuse standard vision-language model components with new selection heuristics whose internal thresholds or scaling factors are not disclosed.

pith-pipeline@v0.9.0 · 5491 in / 1150 out tokens · 48328 ms · 2026-05-10T13:50:21.368095+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 25 canonical work pages · 11 internal anchors

[1]

Phi-4 Technical Report

Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905,

work page internal anchor Pith review arXiv
[2]

Qwen2.5-VL Technical Report

Ac- cessed: 2025-02-25. Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Pali: A jointly-scaled mul- tilingual language-image model

Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., and Lin, D. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pp. 370–387. Springer, 2024a. Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A. J., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al...

work page arXiv
[4]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y ., Liu, Y ., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al. Expanding per- formance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024b. Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al. Intern...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024

Chu, X., Qiao, L., Zhang, X., Xu, S., Wei, F., Yang, Y ., Sun, X., Hu, Y ., Lin, X., Zhang, B., et al. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766,

work page arXiv
[6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Explainable and interpretable multimodal large language models: A comprehensive survey

Dang, Y ., Huang, K., Huo, J., Yan, Y ., Huang, S., Liu, D., Gao, M., Zhang, J., Qian, C., Wang, K., et al. Explainable and interpretable multimodal large language models: A comprehensive survey.arXiv preprint arXiv:2412.02104,

work page arXiv
[8]

Fuse-rsvlm: Feature fusion vision-language model for remote sensing

Dang, Y ., Gao, M., Yan, Y ., Zou, X., Gu, Y ., Li, J., Wang, J., Jiang, P., Liu, A., Liu, J., et al. Exploring response uncertainty in mllms: An empirical evaluation under mis- leading scenarios. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 18143–18184, 2025a. 9 UHR-BAT: Budget-Aware Token Compression Visi...

work page arXiv 2025
[9]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Bowen Jing, Bonnie Berger, and Tommi Jaakkola

Jaegle, A., Borgeaud, S., Alayrac, J.-B., Doersch, C., Ionescu, C., Ding, D., Koppula, S., Zoran, D., Brock, A., Shelhamer, E., et al. Perceiver io: A general archi- tecture for structured inputs & outputs.arXiv preprint arXiv:2107.14795, 2021a. Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., and Carreira, J. Perceiver: General perception w...

work page arXiv
[11]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y ., Liu, Z., et al. Llava- onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational confe...

work page Pith review arXiv
[12]

Zoomearth: Active perception for ultra-high-resolution geospatial vision-language tasks.arXiv preprint arXiv:2511.12267, 2025

10 UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing Liu, R., Fu, B., Song, J., Li, K., Li, W., Xue, L., Qiao, H., Zhang, W., Meng, D., and Cao, X. Zoomearth: Ac- tive perception for ultra-high-resolution geospatial vision- language tasks.arXiv preprint arXiv:2511.12267,

work page arXiv
[13]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Yang, H., et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525,

work page internal anchor Pith review arXiv
[14]

Feast your eyes: Mixture-of- resolution adaptation for multimodal large language models

Luo, G., Zhou, Y ., Zhang, Y ., Zheng, X., Sun, X., and Ji, R. Feast your eyes: Mixture-of-resolution adaptation for multimodal large language models.arXiv preprint arXiv:2403.03003, 2024a. Luo, J., Pang, Z., Zhang, Y ., Wang, T., Wang, L., Dang, B., Lao, J., Wang, J., Chen, J., Tan, Y ., et al. Skysensegpt: A fine-grained instruction tuning dataset and m...

work page arXiv
[15]

arXiv preprint arXiv:2312.06960 , year=

Mall, U., Phoo, C. P., Liu, M. K., V ondrick, C., Hariharan, B., and Bala, K. Remote sensing vision-language foun- dation models without annotations via ground remote alignment.arXiv preprint arXiv:2312.06960,

work page arXiv
[16]

Pang, C., Wu, J., Li, J., Liu, Y ., Sun, J., Li, W., Weng, X., Wang, S., Feng, L., Xia, G.-S., et al

Accessed: 2025-08-10. Pang, C., Wu, J., Li, J., Liu, Y ., Sun, J., Li, W., Weng, X., Wang, S., Feng, L., Xia, G.-S., et al. H2rsvlm: Towards helpful and honest remote sensing large vision language model.CoRR,

2025
[17]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Sori- cut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Geollava-8k: Scaling remote-sensing multimodal large language models to 8k resolution

Wang, F., Chen, M., Li, Y ., Wang, D., Wang, H., Guo, Z., Wang, Z., Shan, B., Lan, L., Wang, Y ., et al. Geollava-8k: Scaling remote-sensing multimodal large language mod- els to 8k resolution.arXiv preprint arXiv:2505.21375, 2025a. Wang, F., Wang, H., Guo, Z., Wang, D., Wang, Y ., Chen, M., Ma, Q., Lan, L., Yang, W., Zhang, J., et al. Xlrs-bench: Could y...

work page arXiv
[19]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025c. Wu, J., Gan, W., Chen, Z., Wan, S., and Yu, P. S. Multimodal large language models: A survey. In2023 IEEE Interna- tional Co...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Annotation-free visual reasoning for high-resolution large multimodal models via rein- forcement learning.arXiv preprint arXiv:2602.23615,

11 UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing Yang, J., Chen, A., Dang, Y ., Fan, Q., Wang, C., Li, W., Miao, F., and Gao, Y . Annotation-free visual reasoning for high-resolution large multimodal models via rein- forcement learning.arXiv preprint arXiv:2602.23615,

work page arXiv
[21]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yao, Y ., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., Cai, T., Li, H., Zhao, W., He, Z., et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800,

work page internal anchor Pith review arXiv
[22]

Tinygpt- v: Efficient multimodal large language model via small backbones.arXiv preprint arXiv:2312.16862,

Yuan, Z., Li, Z., Huang, W., Ye, Y ., and Sun, L. Tinygpt- v: Efficient multimodal large language model via small backbones.arXiv preprint arXiv:2312.16862,

work page arXiv
[23]

Internlm- xcomposer: A vision-language large model for advanced text-image comprehension and composition

Zhang, P., Dong, X., Wang, B., Cao, Y ., Xu, C., Ouyang, L., Zhao, Z., Duan, H., Zhang, S., Ding, S., et al. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112,

work page arXiv
[24]

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

Zhang, P., Dong, X., Zang, Y ., Cao, Y ., Qian, R., Chen, L., Guo, Q., Duan, H., Wang, B., Ouyang, L., et al. Internlm-xcomposer-2.5: A versatile large vision lan- guage model supporting long-contextual input and output. arXiv preprint arXiv:2407.03320, 2024a. Zhang, W., Cai, M., Zhang, T., Zhuang, Y ., and Mao, X. Earthgpt: A universal multimodal large l...

work page arXiv
[25]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592,

work page internal anchor Pith review arXiv
[26]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y ., Su, W., Shao, J., et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Background

12 UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing A. Background. Resource-Efficient Training and Inference for High-Resolution Remote Sensing.The rapid proliferation of high- resolution remote sensing platforms, including satellites, Unmanned Aerial Vehicles (UA Vs), and precision agriculture systems...

2025
[28]

Regarding our model, in the case of XLRS-Bench, the budget Bs is assigned as 180,1320,1600 , and 8000 corresponding to the four resolutions

For evaluation, baseline results are sourced from the original dataset papers. Regarding our model, in the case of XLRS-Bench, the budget Bs is assigned as 180,1320,1600 , and 8000 corresponding to the four resolutions. Similarly, in the case of MMERealWorld-RS and RSHR-Bench, we adopt budgets of80,320,1600, and4000, respectively. C. Implementation Detail...

1967
[29]

By integrating spatial distance, the clustering process effectively groups tokens that are both semantically similar and geographically close

with k centers iteratively partitions the input space and minimizes the objectiveJto find centroids{µ r}k r=1: J= NX i=1 ∥ui −µ ci ∥2 2, c i = arg min 1≤r≤k ∥ui −µ r∥2 2,(14) yielding clusters {Sr}k r=1 as the partition P(s). By integrating spatial distance, the clustering process effectively groups tokens that are both semantically similar and geographic...

2023