pith. sign in

arxiv: 2605.19792 · v1 · pith:FIGDRADZnew · submitted 2026-05-19 · 💻 cs.CV

Mechanisms of Object Localization in Vision-Language Models

Pith reviewed 2026-05-20 06:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelsobject localizationmechanistic interpretabilityattention headscausal mediation analysistoken ablationcontainerization mechanismbounding box prediction
0
0 comments X

The pith

Vision-language models localize objects by using specific tokens to mark boundaries, with the arrangement of tokens inside those boundaries having little impact on the box prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates the internal processes that allow vision-language models to locate objects in images. Using ablations and mediation analysis on models like LLaVA-1.5 and InternVL, it shows that localization depends on a containerization process. In this process, certain tokens outline the object's spatial extent. The meaning or order of tokens within the outline does not strongly influence the output box. The work also identifies that a small number of attention heads handle the key computations for both classification and localization, though in different layers depending on the model.

Core claim

Localization is driven by a containerization mechanism in which object-aligned tokens define the spatial extent of the object, while the semantic arrangement of tokens within those boundaries is largely irrelevant to the predicted box. Only a very small set of attention heads mediates the causal effect for both classification and localization, concentrating in early-mid layers for LLaVA and mid-late layers for InternVL. The two tasks share some early processing but ultimately depend on largely distinct specialized heads.

What carries the argument

Containerization mechanism, where object-aligned tokens define the spatial extent of the object for the localization task.

If this is right

  • Revealing these narrow computational pathways can guide future model design for better visual grounding.
  • Grounding objectives can be optimized around boundary definition rather than full semantic processing inside objects.
  • Interventions on the small set of specialized heads could simultaneously affect classification and localization performance.
  • The layer differences between models suggest architecture-dependent strategies for improving localization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar containerization might occur in other multimodal models, allowing targeted debugging of localization errors.
  • Models could be made more efficient by focusing training resources on the identified attention heads.
  • Prompt engineering that targets boundary tokens might selectively impair or enhance localization without affecting other capabilities.

Load-bearing premise

Token ablations and attention knockouts isolate the exact causal roles of specific tokens and heads without interference from remaining model components or dependence on the particular images and prompts tested.

What would settle it

Ablating the identified small set of attention heads on a fresh set of images and prompts, then finding that localization performance stays intact or shifts to other heads, would show the claimed mechanism does not hold generally.

Figures

Figures reproduced from arXiv: 2605.19792 by Gemma Roig, Martina G. Vilas, Timothy Schauml\"offel.

Figure 1
Figure 1. Figure 1: Alignment between predicted and scaled ground-truth [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Positional decoding results. Left: average position accuracy per layer for visual backbone (0-23), the multimodal projection [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance after attention knockout. We block atten [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mediation Fraction (MF) scores for every attention head [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Localization accuracy under cumulative head ablation. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Dataset examples. Example images from the COCO validation set where exactly one object per image is removed (highlighted by its original bounding box) and the background is filled using an inpainting strategy [30]. This procedure allows us to filter the dataset for potential hallucinations: if the model can still detect the removed object purely from contextual cues, it undermines the validity of our groun… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of object mask for the ablation experiment. Left: the original 336×336 input image in pixel space with an annotated object mask. This mask is mapped onto the 24×24 token grid of the vision transformer, where a token is selected if it has any pixel overlap with the original mask. Right: examples of padding applied to the token mask. Negative padding removes adjacent tokens and shrinks the abla… view at source ↗
Figure 8
Figure 8. Figure 8: Object extension additional Results. Alignment between predicted and scaled ground-truth bounding boxes under object padding. Each cell shows the mean accuracy between predictions obtained with a given padding level and ground-truth boxes scaled by different amounts. Diagonal entries correspond to matching padding and scaling levels, indicating how well the predicted box size adapts to the artificially enl… view at source ↗
Figure 9
Figure 9. Figure 9: Examples of the object extension experiment. Each image shows the input with its mask in pixel space. The yellow region indicates the original mask, while the green region denotes the padding p added by sampling tokens from the object. The top row corre￾sponds to p = 1 and the bottom row to p = 2. We display both the predicted and ground-truth bounding boxes for the original and the extended object. The pr… view at source ↗
Figure 10
Figure 10. Figure 10: Performance drop for classification and detection when ablating global, local, or both image views across object sizes. We [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Heatmap visualizations of positional decoding accuracy at selected stages of LLaVA. Each heatmap shows the probability of correctly predicting the position of a visual token in the 24 × 24 grid. The multimodal projection retains positional information mainly at the four corners, effectively marking the image boundaries needed to infer its dimensions. Accuracy then increases within the LLM, becoming highes… view at source ↗
Figure 12
Figure 12. Figure 12: Attention blocking across layers. We measure classification and localization accuracy when blocking attention from post￾image tokens to object tokens, either in groups of six layers (left) or one layer at a time (right). Localization accuracy drops sharply in early–mid layers for LLaVA models and in mid–late layers for InternVL, while classification remains largely stable across the network. Blocking atte… view at source ↗
Figure 13
Figure 13. Figure 13: Causal mediation via activation patching. We compare three model runs: (1) the source run, where the object is present and the model produces the correct answer; (2) the base run, where the object is removed and the model fails; and (3) the patched run, where we transfer hidden activations from a selected attention head in the source run into the base run. Improvements in the patched prediction indicate t… view at source ↗
Figure 14
Figure 14. Figure 14: Causal Mediation Analysis for LLava13b. Mediation Fraction scores for every attention head across all layers, shown separately for the detection and classification tasks [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Localization accuracy under cumulative head ablation. Attention heads are ranked by their mean MF and progressively removed. [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
read the original abstract

Visually-grounded language models (VLMs) are highly effective in linking visual and textual information, yet they often struggle with basic classification and localization tasks. While classification mechanisms have been studied more extensively, the processes that support object localization remain poorly understood. In this work, we investigate two representative families, LLaVA-1.5 and InternVL-3.5, using a suite of mechanistic interpretability tools, including token ablations, attention knockout, and causal mediation analysis. We find that localization is driven by a containerization mechanism in which object-aligned tokens define the spatial extent of the object, while the semantic arrangement of tokens within those boundaries is largely irrelevant to the predicted box. Only a very small set of attention heads mediates the causal effect for both classification and localization, concentrating in early-mid layers for LLaVA and mid-late layers for InternVL. The two tasks share some early processing but ultimately depend on largely distinct specialized heads. Overall, we provide the first layer- and head-level account of localization in VLMs, revealing narrow computational pathways that can guide future model design and grounding objectives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates mechanisms of object localization in VLMs (LLaVA-1.5 and InternVL-3.5) via token ablations, attention knockout, and causal mediation analysis. It claims localization follows a containerization mechanism in which object-aligned tokens set spatial extent while internal semantic arrangement is largely irrelevant to the predicted box; only a small set of attention heads (early-mid layers in LLaVA, mid-late in InternVL) mediates causal effects for both classification and localization, with shared early processing but distinct later heads.

Significance. If the interventional results hold, the work supplies the first layer- and head-level mechanistic account of localization in VLMs. The identification of narrow computational pathways and the containerization finding could directly inform grounding objectives and architecture choices. The reliance on causal interventions rather than purely correlational measures is a methodological strength.

major comments (2)
  1. [Results and Causal Mediation Analysis sections] The containerization claim and narrow-head conclusion rest on the assumption that token ablations and attention knockouts cleanly isolate contributions without compensatory interference from remaining heads or layers. The manuscript does not report controls for this (e.g., random-head knockouts or tests for performance recovery in unablated components), which is load-bearing for the central mechanism.
  2. [Token Ablation Experiments] The finding that semantic arrangement inside object-aligned tokens is irrelevant depends on the specific images and prompts tested. No robustness checks across prompt variations or diverse image datasets are described, leaving open the possibility that the 'irrelevant arrangement' result is sensitive to experimental choices.
minor comments (2)
  1. [Methods] Clarify the exact quantitative thresholds used to identify the 'very small set' of mediating heads (e.g., effect-size cutoffs in the mediation analysis).
  2. [Figures] Ensure all figures showing head concentrations include explicit layer and head indices for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major point below and outline revisions that strengthen the manuscript's methodological rigor and robustness.

read point-by-point responses
  1. Referee: [Results and Causal Mediation Analysis sections] The containerization claim and narrow-head conclusion rest on the assumption that token ablations and attention knockouts cleanly isolate contributions without compensatory interference from remaining heads or layers. The manuscript does not report controls for this (e.g., random-head knockouts or tests for performance recovery in unablated components), which is load-bearing for the central mechanism.

    Authors: We agree that explicit controls for compensatory interference are necessary to support the specificity of the identified heads and the containerization mechanism. While the causal mediation analysis already isolates effects at the head level, we recognize that the original submission lacked random ablation baselines. We have since run additional experiments ablating matched numbers of randomly selected heads across the same layers; these show substantially smaller performance drops in both localization and classification compared to ablating the specialized heads. We will add these control results, along with a discussion of their implications, to the Results and Causal Mediation Analysis sections in the revised manuscript. revision: yes

  2. Referee: [Token Ablation Experiments] The finding that semantic arrangement inside object-aligned tokens is irrelevant depends on the specific images and prompts tested. No robustness checks across prompt variations or diverse image datasets are described, leaving open the possibility that the 'irrelevant arrangement' result is sensitive to experimental choices.

    Authors: We concur that the irrelevance of internal token arrangement within object boundaries requires demonstration of robustness. The original experiments spanned multiple object categories and image sources, yet did not include systematic prompt paraphrasing or additional datasets. We will therefore incorporate new experiments that vary prompt wording (e.g., different phrasings of the localization query) and evaluate on a broader collection of images drawn from additional benchmarks. These results will be reported in an expanded Token Ablation Experiments section to confirm that the containerization finding is not an artifact of the original experimental choices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical interventions produce independent mechanistic findings

full rationale

The paper reports results from token ablations, attention knockout, and causal mediation analysis applied to LLaVA-1.5 and InternVL-3.5. These interventions directly yield the containerization mechanism, narrow head involvement, and task-specific specialization as experimental outputs. No equations, fitted parameters, or self-citations are invoked to derive the central claims; the findings are not presupposed by the methods or reduced to inputs by construction. The work is self-contained against external benchmarks of interpretability tooling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of mechanistic interpretability methods and on the assumption that the observed effects generalize beyond the two model families and the specific evaluation setup.

axioms (1)
  • domain assumption Token ablations, attention knockout, and causal mediation analysis accurately identify causal contributions of individual heads and tokens in transformer-based VLMs.
    The entire mechanistic story rests on these tools producing faithful causal attributions.

pith-pipeline@v0.9.0 · 5726 in / 1286 out tokens · 37562 ms · 2026-05-20T06:42:27.273806+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 1 internal anchor

  1. [1]

    Qwen-vl: A ver- satile vision-language model for understanding, localization, text reading, and beyond, 2023

    Jinze Bai, Shuai Bai, Shusheng Yang, et al. Qwen-vl: A ver- satile vision-language model for understanding, localization, text reading, and beyond, 2023. 1, 8

  2. [2]

    Understanding information storage and transfer in multi- modal large language models.Advances in Neural Informa- tion Processing Systems, 37:7400–7426, 2024

    Samyadeep Basu, Martin Grayson, Cecily Morrison, et al. Understanding information storage and transfer in multi- modal large language models.Advances in Neural Informa- tion Processing Systems, 37:7400–7426, 2024. 8

  3. [3]

    Grounding everything: Emerging localiza- tion properties in vision-language transformers

    Walid Bousselham, Felix Petersen, Vittorio Ferrari, and Hilde Kuehne. Grounding everything: Emerging localiza- tion properties in vision-language transformers. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3828–3837, 2024. 1

  4. [4]

    Understanding the limits of vision language models through the lens of the binding problem

    Declan Iain Campbell, Sunayana Rane, and Tyler Giallanza. Understanding the limits of vision language models through the lens of the binding problem. InThe Thirty-eighth An- nual Conference on Neural Information Processing Systems,

  5. [5]

    Why is spa- tial reasoning hard for VLMs? an attention mechanism per- spective on focus areas

    Shiqi Chen, Tongyao Zhu, and Ruochen Zhou. Why is spa- tial reasoning hard for VLMs? an attention mechanism per- spective on focus areas. InForty-second International Con- ference on Machine Learning, 2025. 8

  6. [6]

    Internvl: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, et al. Internvl: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 2

  7. [7]

    Vicuna: An open- source chatbot impressing gpt-4 with 90%* chatgpt quality,

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, et al. Vicuna: An open- source chatbot impressing gpt-4 with 90%* chatgpt quality,

  8. [8]

    Instructblip: To- wards general-purpose vision-language models with instruc- tion tuning.Advances in neural information processing sys- tems, 36:49250–49267, 2023

    Wenliang Dai, Junnan Li, Dongxu Li, et al. Instructblip: To- wards general-purpose vision-language models with instruc- tion tuning.Advances in neural information processing sys- tems, 36:49250–49267, 2023. 1

  9. [9]

    Vision transformers need registers

    Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InThe Twelfth International Conference on Learning Representa- tions, 2024. 3

  10. [10]

    Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models

    Matt Deitke, Christopher Clark, Sangho Lee, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104,

  11. [11]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, et al. Imagenet: A large-scale hierarchical image database. In2009 IEEE con- ference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 3

  12. [12]

    Williams, John Winn, and Andrew Zisserman

    Mark Everingham, Luc Gool, Christopher K. Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.Int. J. Comput. Vision, 88(2): 303–338, 2010. 3, 4

  13. [13]

    Dissecting recall of factual associations in auto- regressive language models

    Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto- regressive language models. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Process- ing, pages 12216–12235, 2023. 5

  14. [14]

    From clip to dino: Visual encoders shout in multi-modal large language models.arXiv preprint arXiv:2310.08825,

    Dongsheng Jiang, Yuchen Liu, Songlin Liu, et al. From clip to dino: Visual encoders shout in multi-modal large language models.arXiv preprint arXiv:2310.08825, 2023. 5

  15. [15]

    Interpreting and editing vision-language representations to mitigate hallucinations

    Nicholas Jiang, Anish Kachinthaya, Suzanne Petryk, and Yossi Gandelsman. Interpreting and editing vision-language representations to mitigate hallucinations. InThe Thir- teenth International Conference on Learning Representa- tions, 2025. 8

  16. [16]

    What’s in the im- age? a deep-dive into the vision of vision language models

    Omri Kaduri, Shai Bagon, and Tali Dekel. What’s in the im- age? a deep-dive into the vision of vision language models. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 14549–14558, 2025. 8

  17. [17]

    Causal tracing of object representations in large vision language models: Mechanis- tic interpretability and hallucination mitigation

    Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Weitao Ma, and Xiachong Feng. Causal tracing of object representations in large vision language models: Mechanis- tic interpretability and hallucination mitigation. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 31645–31653, 2026. 8

  18. [18]

    Mi- crosoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, et al. Mi- crosoft coco: Common objects in context. InEuropean con- ference on computer vision, pages 740–755. Springer, 2014. 2, 1, 5

  19. [19]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2

  20. [20]

    Locating and editing factual associations in gpt

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Be- linkov. Locating and editing factual associations in gpt. In Proceedings of the 36th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2022. Curran Associates Inc. 6

  21. [21]

    Towards in- terpreting visual information processing in vision-language models

    Clement Neo, Luke Ong, Philip Torr, et al. Towards in- terpreting visual information processing in vision-language models. InThe Thirteenth International Conference on Learning Representations, 2025. 1, 5

  22. [22]

    Towards vision-language mechanistic interpretabil- ity: A causal tracing tool for blip

    Vedant Palit, Rohan Pandey, Aryaman Arora, and Paul Pu Liang. Towards vision-language mechanistic interpretabil- ity: A causal tracing tool for blip. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2856–2861, 2023. 8

  23. [23]

    Towards under- standing visual grounding in visual language models.arXiv preprint arXiv:2509.10345, 2025

    Georgios Pantazopoulos and Eda B ¨Ozyi˘git. Towards under- standing visual grounding in visual language models.arXiv preprint arXiv:2509.10345, 2025. 8

  24. [24]

    Grounding multimodal large language models to the world

    Zhiliang Peng, Wenhui Wang, Li Dong, et al. Grounding multimodal large language models to the world. InThe Twelfth International Conference on Learning Representa- tions, 2024. 8

  25. [25]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, et al. Learning transferable visual models from natural language supervision. InInter- national conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 2

  26. [26]

    Can generative multimodal models count to ten? InICLR 2024 Workshop on Representational Align- ment, 2024

    Sunayana Rane, Alexander Ku, and Jason Michael Baldridge. Can generative multimodal models count to ten? InICLR 2024 Workshop on Representational Align- ment, 2024. 8

  27. [27]

    Ex- plore the potential of clip for training-free open vocabulary semantic segmentation

    Tong Shao, Zhuotao Tian, Hang Zhao, and Jingyong Su. Ex- plore the potential of clip for training-free open vocabulary semantic segmentation. InEuropean Conference on Com- puter Vision, pages 139–156. Springer, 2024. 1

  28. [28]

    Benchmark- ing object detectors with coco: A new path forward

    Shweta Singh, Aayan Yadav, Jitesh Jain, et al. Benchmark- ing object detectors with coco: A new path forward. 2024. 2, 1

  29. [29]

    Axiomatic attribution for deep networks

    Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InInternational conference on machine learning, pages 3319–3328. PMLR, 2017. 3

  30. [30]

    Resolution-robust large mask inpainting with fourier convolutions

    Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, et al. Resolution-robust large mask inpainting with fourier convolutions. In2022 IEEE/CVF Winter Conference on Ap- plications of Computer Vision (WACV), pages 3172–3182,

  31. [31]

    Eyes wide shut? exploring the visual shortcomings of multi- modal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, et al. Eyes wide shut? exploring the visual shortcomings of multi- modal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568– 9578, 2024. 1

  32. [32]

    Interpretability in the wild: a circuit for indirect object identification in GPT-2 small

    Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Repre- sentations, 2023. 6

  33. [33]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265,

  34. [34]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report, 2025. 2

  35. [35]

    Emergent sym- bolic mechanisms support abstract reasoning in large lan- guage models.Proceedings of Machine Learning Research, 267:70515–70549, 2025

    Yukang Yang, Declan Campbell, Kaixuan Huang, Mengdi Wang, Jonathan Cohen, and Taylor Webb. Emergent sym- bolic mechanisms support abstract reasoning in large lan- guage models.Proceedings of Machine Learning Research, 267:70515–70549, 2025. 6

  36. [36]

    How multimodal LLMs solve image tasks: A lens on visual grounding, task reasoning, and answer decoding

    Zhuoran Yu and Yong Jae Lee. How multimodal LLMs solve image tasks: A lens on visual grounding, task reasoning, and answer decoding. InSecond Conference on Language Mod- eling, 2025. 8

  37. [37]

    Llava-grounding: Grounded visual chat with large multimodal models

    Hao Zhang, Hongyang Li, Feng Li, et al. Llava-grounding: Grounded visual chat with large multimodal models. In European Conference on Computer Vision, pages 19–35. Springer, 2024. 8

  38. [38]

    Why are visually-grounded language models bad at image classifica- tion? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

    Yuhui Zhang, Alyssa Unell, Xiaohan Wang, et al. Why are visually-grounded language models bad at image classifica- tion? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 1, 8

  39. [39]

    Re- gionclip: Region-based language-image pretraining

    Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, et al. Re- gionclip: Region-based language-image pretraining. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16793–16803, 2022. 1 Mechanisms of Object Localization in Vision–Language Models Supplementary Material

  40. [40]

    Dataset We provide additional details on dataset construction, task prompting, and representative examples. 6.1. Dataset Filtering Details We evaluate on the COCO validation split [18] with label corrections from [28], and apply the following filtering steps to improve annotation quality and satisfy the requirements of our experimental setup. 1.Object siz...

  41. [41]

    Ablation Study 7.1. Visualization of Masking 0 100 200 300 0 50 100 150 200 250 300 Image with ground-truth Mask 0 5 10 15 20 0 5 10 15 20 Resized Mask 0 5 10 15 20 0 5 10 15 20 Mask with Padding Padding T ypes Padding = -2 Padding = -1 Padding = 0 Padding = +1 Padding = +2 Figure 7.Visualization of object mask for the ablation experiment. Left: the origi...

  42. [42]

    The multimodal projection retains positional information mainly at the four corners, effectively marking the image boundaries needed to infer its dimensions

    Positional Information 0 8 16 23 0 8 16 23 Position Projection (LLaVA-7B/13B) 0 8 16 23 0 8 16 23 Layer 13 (LLaVA-7B) 0 8 16 23 0 8 16 23 Layer 12 (LLaVA-13B) 0 8 15 Position 0 8 15 Position Projection (InternVL3.5-ViT) 0 8 15 Position 0 8 15 Layer 7 (InternVL3.5) 0 8 15 Position 0 8 15 Layer 26 (InternVL3.5) 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Figure 11.Hea...

  43. [43]

    Localization accuracy drops sharply in early–mid layers for LLaV A models and in mid–late layers for InternVL, while classification remains largely stable across the network

    Attention Blocking 0-5 6-11 12-17 18-23 24-29 30-31 Layer 0.0 0.2 0.4 0.6Accuracy LLaVA-7b | six layer-wise 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Layer LLaVA-7b | single layer-wise 0-5 6-11 12-17 18-23 24-29 30-35 36-39 Layer 0.0 0.2 0.4 0.6Accuracy LLaVA-13b | six layer-wise 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 Layer LLaVA-13b | si...

  44. [44]

    Visualization of CMA Method ViT Is there a motorcycle in the image? P Yes No ViT Is there a motorcycle in the image? P

    Causal Mediation Analysis 10.1. Visualization of CMA Method ViT Is there a motorcycle in the image? P Yes No ViT Is there a motorcycle in the image? P

  45. [45]

    Source Run

    Patched Run ViT Is there a motorcycle in the image? P LLM LLM LLM Activation Transfer1. Source Run

  46. [46]

    Improvements in the patched prediction indicate that the transferred head carries task-relevant information

    Base Run Yes No Yes No Logits Logits Logits Transformer Layer Attention Head Forward Pass Legend: Patching Figure 13.Causal mediation via activation patching.We compare three model runs: (1) the source run, where the object is present and the model produces the correct answer; (2) the base run, where the object is removed and the model fails; and (3) the ...