HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

Bo Zheng; Jian Xu; Liang Wu; Xianjie Liu; Yiman Hu; Yixiong Zou

arxiv: 2510.00054 · v2 · pith:Y2U5SNI2new · submitted 2025-09-28 · 💻 cs.CV · cs.AI

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

Xianjie Liu , Yiman Hu , Yixiong Zou , Liang Wu , Jian Xu , Bo Zheng This is my paper

classification 💻 cs.CV cs.AI

keywords decouplinghidebackgroundmllmsattentionbenchdecoupleframework

0 comments

read the original abstract

Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding tasks. However, their performance on high-resolution images remains suboptimal. While existing approaches often attribute this limitation to perceptual constraints and argue that MLLMs struggle to recognize small objects, leading them to use "zoom in" strategies for better detail, our analysis reveals a different cause: the main issue is not object size, but rather caused by complex background interference. We systematically analyze this "zoom in" operation through a series of decoupling experiments and propose the Hierarchical Decoupling Framework (HiDe), a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens, then leverages their attention weights to achieve precise alignment with the target visual regions. Subsequently, it employs Layout-Preserving Decoupling (LPD) to decouple these regions from the background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference. HiDe sets a new SOTA on V*Bench, HRBench4K, and HRBench8K, boosting Qwen2.5-VL 7B and InternVL3 8B to SOTA (92.1% and 91.6% on V*Bench), even surpassing RL methods. After optimization, HiDe uses 75% less memory than the previous training-free approach. Code is provided in https://tennine2077.github.io/HiDe.github.io/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
cs.CV 2026-05 unverdicted novelty 6.0

Vision-OPD uses on-policy self-distillation from crop-conditioned to full-image policies within the same MLLM to close the regional-to-global perception gap.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.