pith. sign in

arxiv: 2605.27243 · v1 · pith:F4QJDZE4new · submitted 2026-05-26 · 💻 cs.CV

Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models

Pith reviewed 2026-06-29 18:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal retrieval headsvision-language modelslong-context modelingattention headsevidence retrievaldocument understandingimage retrieval
0
0 comments X

The pith

Sparse attention heads handle most retrieval of both text and images in long-context vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a detection method that scores attention from question tokens to evidence tokens appearing in either text or images. It establishes that multimodal retrieval heads are sparse: 4.4-10.2 percent of all heads carry half the positive retrieval-score mass. Selectively masking the top 5 percent of these heads produces large accuracy drops on MMLongBench-Doc and SlideVQA, while random masking causes far smaller damage. The heads show partial sharing across modalities yet change more rapidly for images when context length or haystack modality shifts. The same heads can also rank visually rich documents directly, improving recall metrics without any additional training.

Core claim

Vision-language models contain sparse multimodal retrieval heads that locate relevant evidence across interleaved text and images. These heads are identified by an attention-scoring method from question tokens to textual or visual evidence tokens. They concentrate 50 percent of the positive retrieval-score mass in only 4.4-10.2 percent of heads and prove causally important because masking the top 5 percent collapses performance on long-context benchmarks while random masking does not. The heads are partly shared across modalities but more dynamic for images as context varies, and they support direct document ranking on MMDocIR without further training.

What carries the argument

multimodal retrieval head detection method that scores attention from question tokens to textual or visual evidence tokens

If this is right

  • Only 4.4-10.2% of heads account for 50% of positive retrieval-score mass.
  • Masking the top 5% selected heads drops MMLongBench-Doc from 48.2% to 5.7%.
  • The same masking drops SlideVQA from 71.2% to 8.9%.
  • The heads are partly shared across text and image modalities but more dynamic for images.
  • Without training, the heads improve Recall@1 on MMDocIR page and layout retrieval over baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models could be pruned or specialized around these heads to reduce compute in long-context settings.
  • The detection approach may apply to other multimodal inputs such as video or audio sequences.
  • Targeted interventions on these heads could debug or enhance retrieval behavior in deployed VLMs.

Load-bearing premise

The attention-scoring method from question tokens to evidence tokens isolates heads that are causally responsible for retrieval performance rather than merely correlated with it.

What would settle it

An experiment in which masking the top 5 percent of selected heads produces no larger performance drop on MMLongBench-Doc or SlideVQA than masking an equal number of random heads would falsify the claim of causal importance.

Figures

Figures reproduced from arXiv: 2605.27243 by Aaron Branson Cigres Li, Ginny Wong, Haobo Li, Haodong Duan, Lishu Luo, Pasquale Minervini, Simon See, Xiyu Ren, Yangqiu Song, Yiming Du, Yu Zhao, Zhaowei Wang.

Figure 1
Figure 1. Figure 1: Removing multimodal retrieval heads causally [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Schematic of MMRetHead detection. For each attention head, we score the post-softmax attention from [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Retrieval-score mass concentration. Each cell [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Top-5% text-retrieval heads in Qwen3-VL-8B. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Context-length sensitivity of retrieval-head se [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: shows that masking these retrieval heads sharply reduces MMLongBench-Doc score from 48.2% to 5.7% and SlideVQA score from 71.2% to 8.9%. In contrast, masking randomly chosen attention heads is less damaging, leaving scores of 32.2% and 52.6%, respectively. These results show that retrieval heads identified in controlled MM￾NIAH tasks remain causally important on Long Document VQA tasks beyond the detection… view at source ↗
Figure 10
Figure 10. Figure 10: Retrieval-task sensitivity of retrieval-head se [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Sensitivity to the haystack image ratio. Each [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Top-5% text-retrieval heads in Gemma3- 12B-IT. Blue heads are preserved from Gemma3-12B￾PT, while orange heads newly enter the top-5% set after vision-language training. Decode￾only Prefill+ Decode Decode￾only Prefill+ Decode Masking type 0 50 100 Accuracy (%) 42.8 0.0 99.8 2.9 Image needle Text needle [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 12
Figure 12. Figure 12: Decode-only versus prefill-plus-decoding [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
read the original abstract

Large vision-language models increasingly rely on long-context modeling to reason over documents, hour-level videos, and long-horizon agent trajectories, requiring them to locate relevant evidence across interleaved text and images. Prior work has studied this behavior using retrieval heads in large language models, but its copy-based criterion does not directly apply when evidence appears in images. We introduce a multimodal retrieval head detection method that scores attention from question tokens to textual or visual evidence. With this method, we show that multimodal retrieval heads are sparse, intrinsic, and causally important: only 4.4-10.2% of attention heads account for 50% of the positive retrieval-score mass, and masking the top-5% selected heads drops MMLongBench-Doc from 48.2% to 5.7% and SlideVQA from 71.2% to 8.9%, while random-head masking is far less damaging. Further analysis shows that these heads are partly shared across modalities yet remain dynamic within each modality, with image retrieval heads changing more than text retrieval heads as context length and haystack modality change. Without further training, we find that these heads can also be used directly to rank visually rich documents: on MMDocIR, Qwen3-VL-8B selected-head scoring improves Recall@1 by 7.7/7.4 macro/micro points for page retrieval and 6.3/6.8 points for layout retrieval over the strongest reported baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces a multimodal retrieval head detection method that scores attention from question tokens to textual or visual evidence tokens in long-context VLMs. It claims these heads are sparse (only 4.4-10.2% of heads account for 50% of positive retrieval-score mass), intrinsic, and causally important, as masking the top-5% selected heads drops MMLongBench-Doc from 48.2% to 5.7% and SlideVQA from 71.2% to 8.9% while random masking is far less damaging. The heads are partly shared across modalities but dynamic (image heads change more with context length and haystack modality), and the selected heads can be used zero-shot to improve Recall@1 on MMDocIR page and layout retrieval over baselines.

Significance. If the results hold, the work extends retrieval-head findings from LLMs to multimodal long-context settings and identifies a sparse mechanistic component for evidence retrieval across text and images. The differential ablation (top-k vs. random masking) directly tests causality rather than mere correlation. The zero-shot application to document ranking on MMDocIR is a practical strength. Use of public benchmarks and reporting of concrete performance numbers (including the 50% mass sparsity statistic) aids reproducibility.

minor comments (2)
  1. [Method (around the multimodal retrieval head detection description)] The exact attention-scoring formula, including how positive retrieval-score mass is aggregated and normalized across heads and modalities, should be stated explicitly with an equation in the method section to support full reproduction of the sparsity percentages.
  2. [Experiments (ablation results on MMLongBench-Doc and SlideVQA)] The ablation experiments report large performance drops but do not include standard deviations across multiple random seeds or statistical significance tests; adding these would strengthen the claim that the differential effect is robust.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation, accurate summary of our contributions on multimodal retrieval heads, and recommendation for minor revision. The report correctly highlights the sparsity (4.4-10.2% heads for 50% mass), causal importance via differential ablation, partial cross-modal sharing, and zero-shot gains on MMDocIR. No major comments were listed in the report, so we have no points requiring rebuttal or revision at this time.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central claims rest on an introduced attention-scoring procedure applied to existing models, followed by direct empirical measurements of head sparsity (4.4-10.2% heads for 50% mass) and controlled ablation results on public benchmarks (MMLongBench-Doc, SlideVQA, MMDocIR). These quantities are computed from model outputs and dataset performance; no equations or self-citations reduce the reported percentages or drops to quantities defined by the paper's own fitted parameters or prior author results. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper introduces a new detection procedure and the concept of multimodal retrieval heads defined by attention scores; no free parameters are fitted to produce the headline percentages, and no new physical or mathematical entities are postulated.

axioms (1)
  • standard math Standard transformer attention mechanism computes relevance between query and key tokens in the usual way.
    Invoked when defining the retrieval-score from question tokens to evidence tokens.
invented entities (1)
  • multimodal retrieval head no independent evidence
    purpose: Label for attention heads that receive high retrieval scores toward either text or image evidence.
    Defined by the new scoring method; no independent evidence outside the paper's own attention measurements.

pith-pipeline@v0.9.1-grok · 5844 in / 1410 out tokens · 35296 ms · 2026-06-29T18:27:16.633566+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

4 extracted references · 3 canonical work pages · 3 internal anchors

  1. [1]

    InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4135–4144

    Unveiling Visual Perception in Language Mod- els: An Attention Head Analysis Approach. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4135–4144. Shijie Chen, Bernal Jiménez Gutiérrez, and Yu Su. 2025. Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers. InInternational Conference on Learni...

  2. [2]

    Gemma 3 Technical Report

    ColPali: Efficient Document Retrieval with Vi- sion Language Models. InInternational Conference on Learning Representations. Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, and 1 oth- ers. 2025. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in ...

  3. [3]

    WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

    Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748. Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. 2024. OPERA: Alleviat- ing Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection- Allocation. InPro...

  4. [4]

    MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

    Memlens: Benchmarking multimodal long- term memory in large vision-language models.arXiv preprint arXiv:2605.14906. Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Be- 10 yond.Foundations and Trends in Information Re- trieval, 3(4):333–389. Sofia Serrano and Noah A. Smith. 2019. Is Attention Interpretable? InProc...