Listening makes Vision Clear for VLMs

Binrui Shen; Yixin Tan; Yiyang Chen

arxiv: 2606.23763 · v1 · pith:6ZP3TVDWnew · submitted 2026-06-22 · 💻 cs.CV · cs.AI

Listening makes Vision Clear for VLMs

Yiyang Chen , Yixin Tan , Binrui Shen This is my paper

Pith reviewed 2026-06-26 09:07 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language modelsattention mapstoken activationlocalization metricsprompt-side evaluationmultimodal consistencydecoding drift

0 comments

The pith

Prompt-side token attention with boundary filtering measures vision-language alignment in VLMs more accurately than answer-side attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that attention taken from answer tokens in vision-language models often gets pulled off target by accumulated language priors and by structural tokens like modality markers that blanket the whole input. It replaces that approach with attention drawn from the prompt tokens themselves, cleaned of those markers, and scored by the peak of the attention distribution rather than simple overlap masks. The result is a Prompt-Vision Token Activation Map that produces higher scores on both attention-based and IoU-style localization tests across multiple datasets. A reader would care because current ways of checking whether a model is actually looking at the right image region when it answers are noisy, and cleaner checks matter for trusting or improving large VLMs.

Core claim

Answer-side attention distributions in VLMs suffer from decoding drift caused by previously generated tokens and from high attention on irrelevant regions induced by modality boundary markers. Prompt-Vision Token Activation Map extracts attention from prompt-side semantics, applies a filter to remove the boundary-marker bias, and evaluates alignment by the peak distribution of that attention rather than mask overlap alone, yielding consistently higher localization metrics than answer-side baselines.

What carries the argument

Prompt-Vision Token Activation Map (PV-TAM), which pulls attention weights from prompt tokens, filters modality boundary markers, and scores alignment via peak attention distribution between prompt semantics and visual regions.

If this is right

PV-TAM raises both attention-based and IoU-style localization scores compared with answer-side baselines.
The improvement holds across multiple vision-language datasets.
The method supplies a consistency evaluation that avoids the accumulation of language priors during answer generation.
Metrics that use peak attention distribution capture intensity of alignment better than overlap masks alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompt-side filtering idea could be applied to other multimodal models that generate long outputs to reduce drift in their internal attention.
PV-TAM scores might serve as an auxiliary loss during fine-tuning to encourage tighter prompt-visual coupling.
Developers debugging VLM failures on visual questions could inspect the filtered prompt attention maps to locate where the model loses the intended region.

Load-bearing premise

Prompt-side attention after boundary-marker filtering truly sidesteps the distortions introduced by decoding drift and structural tokens.

What would settle it

A controlled test set with known ground-truth visual regions where answer-side attention metrics match human judgments of correct alignment more closely than PV-TAM metrics.

Figures

Figures reproduced from arXiv: 2606.23763 by Binrui Shen, Yixin Tan, Yiyang Chen.

**Figure 1.** Figure 1: A representative example showing the sensitivity of TAM to preceding context [22]. The more red indicates higher attention. But highest attention region is not consistent with the intended semantic area neck. We can see that the input text is a sentence where prior words actually hurt the vision-language alignment. to prompt-side semantics, where the queried tokens are fixed inputs. From a probabilistic p… view at source ↗

**Figure 2.** Figure 2: PV-TAM framework. It builds on a decoder-only backbone to perform alignment. The raw attention map is taken from the last-layer attention weights. We apply three components in refining attention map:(i) prompt-token guided alignment, using the prompt token as the query to attend to vision tokens and reduce semantic interference;(ii) a denoising filter to remove systematic bias introduced by structural to… view at source ↗

**Figure 3.** Figure 3: Visual comparison between PV-TAM and baselines. The red highlights the target token. The expected high level of attention indicated by these red areas should reside in the semantic region corresponding to that token; the more precise the better. tion. 6 Conclusion This work revisits language-vision alignment in VLMs from prompt-side semantics. We identify that existing methods for extracting activation ma… view at source ↗

read the original abstract

Recent work typically assesses vision--language consistency using attention distributions of answer-side tokens. However, we observe that highest attention regions are not always consistent with the intended semantic token. This probably stems from decoding drift, where language priors from previously generated answer tokens accumulate and mismatch with visual attention. Besides the priors from previous answer tokens, we find that structural tokens, e.g., modality boundary markers, may encompass the entire context and generate high attention to areas unrelated to the target. To avoid these distortions and provide consistency evaluation for large VLMs, we adopt prompt-side semantics and propose Prompt-Vision Token Activation Map (PV-TAM). PV-TAM further incorporates a filter to remove systematic bias induced by modality boundary markers. Unlike traditional methods that evaluate overlap solely through masks while ignoring activation intensity, our metrics leverage the peak distribution of attention to measure the alignment between prompts and visual regions. In experiments, PV-TAM consistently improves both attention-based and IoU-style localization metrics over answer-side baselines on various datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PV-TAM shifts attention measurement to prompt tokens with a boundary filter and peak metrics, but the accuracy claim rests on untested assumptions.

read the letter

PV-TAM moves the attention analysis to prompt tokens in VLMs, adds a filter to strip modality boundary markers, and replaces simple mask overlap with peak-distribution metrics. That combination is the concrete addition.

The motivation is straightforward: answer-side attention can drift because of accumulated language priors from generated tokens and because structural markers pull attention across the whole context. Focusing on the prompt side avoids the first problem by construction, and the filter targets the second. The paper reports that this produces higher scores on both attention-based and IoU-style localization metrics than answer-side baselines across the datasets they tested.

The soft spot is the missing link between higher scores and actual faithfulness. The results show a numerical difference, but nothing in the description confirms that the new maps line up better with the intended visual regions than the old ones. No human judgment data, no controlled comparison against ground-truth masks, and no downstream-task correlation are mentioned. Without that, the improvement could reflect a different bias rather than reduced distortion.

The work engages the existing attention literature without circularity or overclaim. It is aimed at people who already use attention maps to diagnose vision-language models. A reader who needs an alternative measurement procedure might find the construction worth trying.

It should go to peer review so the experimental details, baselines, and any code can be checked directly.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes Prompt-Vision Token Activation Map (PV-TAM) as a method to assess vision-language consistency in VLMs. It argues that answer-side token attention is distorted by decoding drift from language priors and by structural tokens such as modality boundary markers. PV-TAM instead uses filtered prompt-side token attention and defines new metrics based on peak attention distribution rather than mask overlap alone. Experiments are said to show consistent gains on both attention-based and IoU-style localization metrics over answer-side baselines across various datasets.

Significance. If the central premise holds, the work would supply a measurement procedure for VLM consistency that is free of fitted parameters and sidesteps answer-generation artifacts. This could strengthen evaluation protocols in the field. The manuscript positions the approach as an alternative procedure rather than a learned model, which is a positive feature.

major comments (2)

[Abstract] Abstract: the claim that PV-TAM supplies a more accurate measure of vision-language consistency (rather than merely a numerically different one) rests on the premise that prompt-side attention after filtering aligns better with intended semantics; no human judgment study, downstream-task correlation, or controlled comparison against ground-truth localization masks is described to test this premise.
[Abstract] Abstract: the reported 'consistent improvements' on attention-based and IoU-style metrics are asserted without any description of the datasets, baseline definitions, statistical tests, variance estimates, or error analysis, so the robustness of the central empirical claim cannot be assessed from the supplied evidence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that PV-TAM supplies a more accurate measure of vision-language consistency (rather than merely a numerically different one) rests on the premise that prompt-side attention after filtering aligns better with intended semantics; no human judgment study, downstream-task correlation, or controlled comparison against ground-truth localization masks is described to test this premise.

Authors: The IoU-style metrics provide a controlled comparison against localization information, but we acknowledge that the manuscript does not include a human judgment study or downstream-task correlation to further validate the premise of improved semantic alignment. The argument for PV-TAM rests on the identified distortions in answer-side attention and the filtering mechanism. We will revise the abstract to moderate the phrasing around 'more accurate' and to clarify the role of the IoU-style evaluation. revision: partial
Referee: [Abstract] Abstract: the reported 'consistent improvements' on attention-based and IoU-style metrics are asserted without any description of the datasets, baseline definitions, statistical tests, variance estimates, or error analysis, so the robustness of the central empirical claim cannot be assessed from the supplied evidence.

Authors: The abstract is a concise summary; the full manuscript describes the datasets, baselines, and reports the metric results. We agree that adding statistical tests, variance estimates, and error analysis would strengthen the presentation of robustness. We will incorporate these elements in the revised manuscript and update the abstract to reference them. revision: yes

Circularity Check

0 steps flagged

No circularity; PV-TAM is an independent measurement procedure

full rationale

The paper proposes PV-TAM as a new prompt-side attention method with a modality-boundary filter to measure vision-language consistency, contrasting it with answer-side baselines. No equations, derivations, or self-citations are shown that reduce the proposed metrics or claims to quantities defined by fitted parameters from the same data, self-referential definitions, or load-bearing prior work by the authors. The method is presented as an alternative procedure whose value is evaluated empirically on datasets, without any reduction of outputs to inputs by construction. This is the common case of a self-contained methodological contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no explicit free parameters, mathematical axioms, or newly postulated entities; the contribution is a measurement procedure rather than a derived model.

pith-pipeline@v0.9.1-grok · 5697 in / 984 out tokens · 18793 ms · 2026-06-26T09:07:50.776285+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 4 canonical work pages

[1]

In: Juraf- sky, D., Chai, J., Schluter, N., Tetreault, J

Abnar, S., Zuidema, W.: Quantifying attention flow in transformers. In: Juraf- sky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 4190–4197. Asso- ciation for Computational Linguistics, Online (Jul 2020).https://doi.org/10. 18653/v1/2020.acl-main.385

2020
[2]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: Overcoming priors for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4971–4980 (2018)

2018
[3]

In: Interna- tional conference on machine learning

Ali, A., Schnake, T., Eberle, O., Montavon, G., M¨ uller, K.R., Wolf, L.: Xai for transformers: Better explanations through conservative propagation. In: Interna- tional conference on machine learning. pp. 435–451. PMLR (2022)

2022
[4]

5-vl technical report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

Pith/arXiv arXiv 2025
[5]

Advances in neural information pro- cessing systems28(2015)

Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information pro- cessing systems28(2015)

2015
[6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chefer, H., Gur, S., Wolf, L.: Transformer interpretability beyond attention visu- alization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 782–791 (June 2021)

2021
[7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Chen, J., Wei, F., Zhao, J., Song, S., Wu, B., Peng, Z., Chan, S.H.G., Zhang, H.: Revisiting referring expression comprehension evaluation in the era of large multimodal models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 513–524 (2025)

2025
[8]

In: Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX

Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: Uniter: Universal image-text representation learning. In: Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX. p. 104–120. Springer-Verlag, Berlin, Heidelberg (2020).https://doi. org/10.1007/978-3-030-58577-8_7

work page doi:10.1007/978-3-030-58577-8_7 2020
[9]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

2024
[10]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)

Fong, R., Patrick, M., Vedaldi, A.: Understanding deep networks via extremal perturbations and smooth masks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)

2019
[11]

Gatis, D.: rembg: Remove image backgrounds.https://github.com/ danielgatis/rembg(2022)

2022
[12]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017) 16 Chen

Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017) 16 Chen. et al

2017
[13]

In: European Conference on Computer Vision

He, J., Yang, S., Yang, S., Kortylewski, A., Yuan, X., Chen, J.N., Liu, S., Yang, C., Yu, Q., Yuille, A.: Partimagenet: A large, high-quality dataset of parts. In: European Conference on Computer Vision. pp. 128–145. Springer (2022)

2022
[14]

In: International conference on machine learning

Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. pp. 4904–4916. PMLR (2021)

2021
[15]

IEEE Transactions on Image Processing (2021)

Jiang, P.T., Zhang, C.B., Hou, Q., Cheng, M.M., Wei, Y.: Layercam: Exploring hierarchical class activation maps for localization. IEEE Transactions on Image Processing (2021)

2021
[16]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: Mdetr - modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 1780– 1790 (October 2021)

2021
[17]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Kang, S., Kim, J., Kim, J., Hwang, S.J.: Your large vision-language model only needs a few attention heads for visual grounding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9339–9350 (2025)

2025
[18]

In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: Referring to objects in photographs of natural scenes. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 787–798 (2014)

2014
[19]

Advances in Neural Information Processing Systems35, 9287–9301 (2022)

Li, C., Liu, H., Li, L., Zhang, P., Aneja, J., Yang, J., Jin, P., Hu, H., Liu, Z., Lee, Y.J., et al.: Elevater: A benchmark and toolkit for evaluating language-augmented visual models. Advances in Neural Information Processing Systems35, 9287–9301 (2022)

2022
[20]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., Chang, K.W., Gao, J.: Grounded language-image pre- training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10965–10975 (June 2022)

2022
[21]

In: Computer Vision – ECCV 2020

Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., Choi, Y., Gao, J.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: Computer Vision – ECCV 2020. p. 121–137. Springer International Publishing, Berlin, Heidelberg (2020).https://doi.org/10.1007/ 978-3-030-58577-8_8

2020
[22]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Li, Y., Wang, H., Ding, X., Wang, H., Li, X.: Token activation map to visually ex- plain multimodal llms. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 48–58 (October 2025)

2025
[23]

Automatica 11, 285–296 (1975)

Otsu, N.: A threshold selection method from gray-level histograms. Automatica 11, 285–296 (1975)

1975
[24]

In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

Petsiuk, V., Jain, R., Manjunatha, V., Morariu, V.I., Mehra, A., Ordonez, V., Saenko, K.: Black-box explanation of object detectors via saliency maps. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11443–11452 (2021)

2021
[25]

In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (December 2015)

Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazeb- nik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (December 2015)

2015
[26]

Pattern Recognition106, 107404 (2020) Listening makes Vision Clear for VLMs 17

Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, O.R., Jagersand, M.: U 2- net: Going deeper with nested u-structure for salient object detection. Pattern Recognition106, 107404 (2020) Listening makes Vision Clear for VLMs 17

2020
[27]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , month =

Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallu- cination in image captioning. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (eds.) Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 4035–4045. Association for Computational Linguistics, Brussels, Belgium (Oct-Nov 2018).http...

work page doi:10.18653/v1/d18-1437 2018
[28]

In: Pro- ceedings of the 3rd Workshop on Neural Generation and Translation

Schmidt, F.: Generalization in generation: A closer look at exposure bias. In: Pro- ceedings of the 3rd Workshop on Neural Generation and Translation. pp. 157–167 (2019)

2019
[29]

In: Proceedings of the IEEE international conference on computer vision

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618–626 (2017)

2017
[30]

In: Proceedings of the IEEE/CVF international conference on computer vision

Selvaraju, R.R., Lee, S., Shen, Y., Jin, H., Ghosh, S., Heck, L., Batra, D., Parikh, D.: Taking a hint: Leveraging explanations to make vision and language models more grounded. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2591–2600 (2019)

2019
[31]

Serrano, S., Smith, N.A.: Is attention interpretable? In: Proceedings of the 57th annual meeting of the association for computational linguistics. pp. 2931–2951 (2019)

2019
[32]

In: Gurevych, I., Miyao, Y

Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Gurevych, I., Miyao, Y. (eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2556–2565. Association for Computational Linguistics, Melb...

work page doi:10.18653/v1/p18-1238 2018
[33]

arXiv preprint arXiv:1706.03825 (2017)

Smilkov, D., Thorat, N., Kim, B., Vi´ egas, F., Wattenberg, M.: Smoothgrad: re- moving noise by adding noise. arXiv preprint arXiv:1706.03825 (2017)

Pith/arXiv arXiv 2017
[34]

Advances in neural information pro- cessing systems30(2017)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)

2017
[35]

arXiv preprint arXiv:2409.12191 (2024)

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

Pith/arXiv arXiv 2024
[36]

In: Inui, K., Jiang, J., Ng, V., Wan, X

Wiegreffe, S., Pinter, Y.: Attention is not not explanation. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Con- ference on Natural Language Processing (EMNLP-IJCNLP). pp. 11–20. Asso- ciation for Computational Linguistics, Hong Kong, Ch...

work page doi:10.18653/v1/d19-1002 2019
[37]

In: Leibe, B., Matas, J., Sebe, N., Welling, M

Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016. pp. 69–85. Springer International Publishing, Cham (2016)

2016
[38]

Advances in Neural Information Processing Systems35, 36067–36080 (2022)

Zhang, H., Zhang, P., Hu, X., Chen, Y.C., Li, L., Dai, X., Wang, L., Yuan, L., Hwang, J.N., Gao, J.: Glipv2: Unifying localization and vision-language under- standing. Advances in Neural Information Processing Systems35, 36067–36080 (2022)

2022
[39]

arXiv preprint arXiv:2602.13600 (2026) 18 Chen

Zhang, J., Liu, F., Du, C., Pang, T.: Adavboost: Mitigating hallucinations in lvlms via token-level adaptive visual attention boosting. arXiv preprint arXiv:2602.13600 (2026) 18 Chen. et al

Pith/arXiv arXiv 2026
[40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., Gao, J.: Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5579–5588 (June 2021)

2021
[41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., Gao, J.: Regionclip: Region-based language-image pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16793–16803 (June 2022)

2022
[42]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep fea- tures for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2921–2929 (2016)

2016
[43]

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., Gao, Z., Cui, E., Wang, X., Cao, Y., Liu, Y., Wei, X., Zhang, H., Wang, H., Xu, W., Li, H., Wang, J., Deng, N., Li, S., He, Y., Jiang, T., Luo, J., Wang, Y., He, C., Shi, B., Zhang, X., Shao, W., He, J., Xiong, Y., Qu, W., Sun, P., Jiao, P., Lv, H., Wu, L., Zhang, ...

Pith/arXiv arXiv 2025

[1] [1]

In: Juraf- sky, D., Chai, J., Schluter, N., Tetreault, J

Abnar, S., Zuidema, W.: Quantifying attention flow in transformers. In: Juraf- sky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 4190–4197. Asso- ciation for Computational Linguistics, Online (Jul 2020).https://doi.org/10. 18653/v1/2020.acl-main.385

2020

[2] [2]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: Overcoming priors for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4971–4980 (2018)

2018

[3] [3]

In: Interna- tional conference on machine learning

Ali, A., Schnake, T., Eberle, O., Montavon, G., M¨ uller, K.R., Wolf, L.: Xai for transformers: Better explanations through conservative propagation. In: Interna- tional conference on machine learning. pp. 435–451. PMLR (2022)

2022

[4] [4]

5-vl technical report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

Pith/arXiv arXiv 2025

[5] [5]

Advances in neural information pro- cessing systems28(2015)

Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information pro- cessing systems28(2015)

2015

[6] [6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chefer, H., Gur, S., Wolf, L.: Transformer interpretability beyond attention visu- alization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 782–791 (June 2021)

2021

[7] [7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Chen, J., Wei, F., Zhao, J., Song, S., Wu, B., Peng, Z., Chan, S.H.G., Zhang, H.: Revisiting referring expression comprehension evaluation in the era of large multimodal models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 513–524 (2025)

2025

[8] [8]

In: Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX

Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: Uniter: Universal image-text representation learning. In: Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX. p. 104–120. Springer-Verlag, Berlin, Heidelberg (2020).https://doi. org/10.1007/978-3-030-58577-8_7

work page doi:10.1007/978-3-030-58577-8_7 2020

[9] [9]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

2024

[10] [10]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)

Fong, R., Patrick, M., Vedaldi, A.: Understanding deep networks via extremal perturbations and smooth masks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)

2019

[11] [11]

Gatis, D.: rembg: Remove image backgrounds.https://github.com/ danielgatis/rembg(2022)

2022

[12] [12]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017) 16 Chen

Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017) 16 Chen. et al

2017

[13] [13]

In: European Conference on Computer Vision

He, J., Yang, S., Yang, S., Kortylewski, A., Yuan, X., Chen, J.N., Liu, S., Yang, C., Yu, Q., Yuille, A.: Partimagenet: A large, high-quality dataset of parts. In: European Conference on Computer Vision. pp. 128–145. Springer (2022)

2022

[14] [14]

In: International conference on machine learning

Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. pp. 4904–4916. PMLR (2021)

2021

[15] [15]

IEEE Transactions on Image Processing (2021)

Jiang, P.T., Zhang, C.B., Hou, Q., Cheng, M.M., Wei, Y.: Layercam: Exploring hierarchical class activation maps for localization. IEEE Transactions on Image Processing (2021)

2021

[16] [16]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: Mdetr - modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 1780– 1790 (October 2021)

2021

[17] [17]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Kang, S., Kim, J., Kim, J., Hwang, S.J.: Your large vision-language model only needs a few attention heads for visual grounding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9339–9350 (2025)

2025

[18] [18]

In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: Referring to objects in photographs of natural scenes. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 787–798 (2014)

2014

[19] [19]

Advances in Neural Information Processing Systems35, 9287–9301 (2022)

Li, C., Liu, H., Li, L., Zhang, P., Aneja, J., Yang, J., Jin, P., Hu, H., Liu, Z., Lee, Y.J., et al.: Elevater: A benchmark and toolkit for evaluating language-augmented visual models. Advances in Neural Information Processing Systems35, 9287–9301 (2022)

2022

[20] [20]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., Chang, K.W., Gao, J.: Grounded language-image pre- training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10965–10975 (June 2022)

2022

[21] [21]

In: Computer Vision – ECCV 2020

Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., Choi, Y., Gao, J.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: Computer Vision – ECCV 2020. p. 121–137. Springer International Publishing, Berlin, Heidelberg (2020).https://doi.org/10.1007/ 978-3-030-58577-8_8

2020

[22] [22]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Li, Y., Wang, H., Ding, X., Wang, H., Li, X.: Token activation map to visually ex- plain multimodal llms. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 48–58 (October 2025)

2025

[23] [23]

Automatica 11, 285–296 (1975)

Otsu, N.: A threshold selection method from gray-level histograms. Automatica 11, 285–296 (1975)

1975

[24] [24]

In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

Petsiuk, V., Jain, R., Manjunatha, V., Morariu, V.I., Mehra, A., Ordonez, V., Saenko, K.: Black-box explanation of object detectors via saliency maps. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11443–11452 (2021)

2021

[25] [25]

In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (December 2015)

Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazeb- nik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (December 2015)

2015

[26] [26]

Pattern Recognition106, 107404 (2020) Listening makes Vision Clear for VLMs 17

Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, O.R., Jagersand, M.: U 2- net: Going deeper with nested u-structure for salient object detection. Pattern Recognition106, 107404 (2020) Listening makes Vision Clear for VLMs 17

2020

[27] [27]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , month =

Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallu- cination in image captioning. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (eds.) Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 4035–4045. Association for Computational Linguistics, Brussels, Belgium (Oct-Nov 2018).http...

work page doi:10.18653/v1/d18-1437 2018

[28] [28]

In: Pro- ceedings of the 3rd Workshop on Neural Generation and Translation

Schmidt, F.: Generalization in generation: A closer look at exposure bias. In: Pro- ceedings of the 3rd Workshop on Neural Generation and Translation. pp. 157–167 (2019)

2019

[29] [29]

In: Proceedings of the IEEE international conference on computer vision

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618–626 (2017)

2017

[30] [30]

In: Proceedings of the IEEE/CVF international conference on computer vision

Selvaraju, R.R., Lee, S., Shen, Y., Jin, H., Ghosh, S., Heck, L., Batra, D., Parikh, D.: Taking a hint: Leveraging explanations to make vision and language models more grounded. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2591–2600 (2019)

2019

[31] [31]

Serrano, S., Smith, N.A.: Is attention interpretable? In: Proceedings of the 57th annual meeting of the association for computational linguistics. pp. 2931–2951 (2019)

2019

[32] [32]

In: Gurevych, I., Miyao, Y

Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Gurevych, I., Miyao, Y. (eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2556–2565. Association for Computational Linguistics, Melb...

work page doi:10.18653/v1/p18-1238 2018

[33] [33]

arXiv preprint arXiv:1706.03825 (2017)

Smilkov, D., Thorat, N., Kim, B., Vi´ egas, F., Wattenberg, M.: Smoothgrad: re- moving noise by adding noise. arXiv preprint arXiv:1706.03825 (2017)

Pith/arXiv arXiv 2017

[34] [34]

Advances in neural information pro- cessing systems30(2017)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)

2017

[35] [35]

arXiv preprint arXiv:2409.12191 (2024)

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

Pith/arXiv arXiv 2024

[36] [36]

In: Inui, K., Jiang, J., Ng, V., Wan, X

Wiegreffe, S., Pinter, Y.: Attention is not not explanation. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Con- ference on Natural Language Processing (EMNLP-IJCNLP). pp. 11–20. Asso- ciation for Computational Linguistics, Hong Kong, Ch...

work page doi:10.18653/v1/d19-1002 2019

[37] [37]

In: Leibe, B., Matas, J., Sebe, N., Welling, M

Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016. pp. 69–85. Springer International Publishing, Cham (2016)

2016

[38] [38]

Advances in Neural Information Processing Systems35, 36067–36080 (2022)

Zhang, H., Zhang, P., Hu, X., Chen, Y.C., Li, L., Dai, X., Wang, L., Yuan, L., Hwang, J.N., Gao, J.: Glipv2: Unifying localization and vision-language under- standing. Advances in Neural Information Processing Systems35, 36067–36080 (2022)

2022

[39] [39]

arXiv preprint arXiv:2602.13600 (2026) 18 Chen

Zhang, J., Liu, F., Du, C., Pang, T.: Adavboost: Mitigating hallucinations in lvlms via token-level adaptive visual attention boosting. arXiv preprint arXiv:2602.13600 (2026) 18 Chen. et al

Pith/arXiv arXiv 2026

[40] [40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., Gao, J.: Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5579–5588 (June 2021)

2021

[41] [41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., Gao, J.: Regionclip: Region-based language-image pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16793–16803 (June 2022)

2022

[42] [42]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep fea- tures for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2921–2929 (2016)

2016

[43] [43]

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., Gao, Z., Cui, E., Wang, X., Cao, Y., Liu, Y., Wei, X., Zhang, H., Wang, H., Xu, W., Li, H., Wang, J., Deng, N., Li, S., He, Y., Jiang, T., Luo, J., Wang, Y., He, C., Shi, B., Zhang, X., Shao, W., He, J., Xiong, Y., Qu, W., Sun, P., Jiao, P., Lv, H., Wu, L., Zhang, ...

Pith/arXiv arXiv 2025