Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions

Donald E. Brown; Jiebei Liu; Nazanin Moradinasab; Saurav Sengupta

arxiv: 2511.17722 · v3 · submitted 2025-11-21 · 💻 cs.CV

Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions

Saurav Sengupta , Nazanin Moradinasab , Jiebei Liu , Donald E. Brown This is my paper

Pith reviewed 2026-05-17 20:01 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsobject countingsynthetic benchmarkattention reweightingcross-modal bindingenumerationvisual attentiondiagnostic framework

0 comments

The pith

Vision-language models count less accurately as visual and linguistic complexity rises, though targeted attention reweighting in the language decoder can strengthen grounding of quantity concepts to visual features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a synthetic benchmark that systematically varies object counts, colors, textures, backgrounds, and prompt specificity to measure how vision-language models perform on enumeration tasks. Results show steady declines in accuracy as these factors increase, patterns that parallel human cognitive load limits during counting. The authors then test interventions that reweight attention to visual tokens inside the language decoder at different layers and record modest gains in correct responses. This approach creates a controlled diagnostic for exposing cross-modal binding failures that standard natural-image tests often leave hidden.

Core claim

Using controlled synthetic images and prompts, the authors establish that VLM counting accuracy degrades systematically with rising numbers of objects, variations in color and texture, and greater prompt specificity. Exploratory attention reweighting in the language model decoder yields modest but measurable improvements by influencing how models connect linguistic quantity concepts to visual representations.

What carries the argument

Synthetic benchmark with perturbations across object number, color, texture, background, and prompt specificity, combined with attention reweighting operations applied to visual tokens in the language decoder layers.

If this is right

Accuracy falls as the number of objects or visual variations such as color and texture increase.
More specific linguistic prompts produce corresponding drops in correct enumeration.
Reweighting attention to visual tokens at selected decoder layers shifts counting outputs in measurable ways.
Many errors trace to cross-modal binding rather than isolated visual processing deficits.
The controlled perturbations expose failure modes that natural-image benchmarks do not isolate easily.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar layer-specific reweighting could be tested on other visual reasoning tasks such as spatial relations or attribute matching.
The patterns suggest training-data biases contribute to quantity errors and may be addressable through decoder adjustments rather than full retraining.
Extending the benchmark to overlapping objects or dynamic scenes would test whether the same degradation and intervention effects persist.
The framework offers a way to compare model enumeration limits directly against human cognitive-load studies using matched stimuli.

Load-bearing premise

The synthetic perturbations and attention reweighting operations isolate the intended cross-modal binding failures without introducing new artifacts that would not appear in natural images.

What would settle it

Applying the same attention reweighting to the benchmark and finding no consistent change in counting accuracy as object numbers or prompt specificity increase would falsify the central claims about degradation and intervention effects.

Figures

Figures reproduced from arXiv: 2511.17722 by Donald E. Brown, Jiebei Liu, Nazanin Moradinasab, Saurav Sengupta.

**Figure 1.** Figure 1: Token-level attention heatmaps for the compositional prompt 5 of object texture task. The high load from texture and color [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Visualization of the model’s attention [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of the model’s attention across different background-texture patterns for images containing fewer than 10 objects [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Example images for the Object category, Color pattern, showing different object colors. 4 [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Example images for the Object category, Texture pattern, showing various texture types. (a) black (b) red (c) yellow (d) blue (e) light gray (f) green [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Example images for the Background category, Color pattern, showing different background colors. (a) Checkerboard (b) Concentric Rings (c) Crosshatch (d) Diagonal Stripes (e) Dots (f) Horizontal Stripes (g) Linear Gradient (h) Radial Gradient (i) Vertical Stripes (j) Zigzag (k) Bubbles (l) Noise [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 9.** Figure 9: Visualization of the model’s attention for models the Qwen2.5-32B-Instruct and InternVL3-9B [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of the model’s attention across different background-texture patterns for images containing fewer than 10 objects [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

read the original abstract

Recent research suggests that Vision Language Models (VLMs) often rely on inherent biases learned during training when responding to queries about visual properties of images. These biases are exacerbated when VLMs are asked highly specific questions that require selective visual attention, a demand that mirrors cognitive challenges observed in human enumeration tasks. We build upon this research by developing a synthetic benchmark dataset and evaluation framework to systematically characterize how counting performance varies as image and prompt properties change. Using open-source VLMs, we analyze how performance shifts across controlled perturbations (e.g. number of objects, object color, background color, object texture, background texture, and prompt specificity) and examine corresponding changes in visual attention allocation. We further conduct exploratory attention reweighting experiments in the language model decoder to modulate focus on visual tokens at different layers and assess their effects on counting behavior. Our results reveal that counting accuracy degrades systematically with increasing visual and linguistic complexity echoing human limits and cognitive load effects known from human perception, while targeted attention reweighting yields modest but measurable improvements. Rather than competing on benchmark accuracy, we introduce a controlled diagnostic framework for analyzing VLM enumeration behavior. Through systematic experiments, we expose failure modes rooted in cross-modal binding that natural image benchmarks may not easily isolate, and provide preliminary empirical evidence that targeted attention reweighting in the language decoder can influence how models ground linguistic quantity concepts in visual representations. Code and data available here: https://github.com/ssen7/vlm-count-analysis

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper supplies a reusable synthetic benchmark for probing VLM counting failures under controlled visual and prompt changes, plus preliminary decoder attention tweaks that still need tighter validation.

read the letter

The core contribution is a new synthetic dataset and evaluation setup that varies object count, colors, textures, backgrounds, and prompt specificity in a systematic way. It shows counting accuracy dropping as visual and linguistic complexity rises, which matches known human enumeration limits, and reports modest gains from reweighting attention toward visual tokens in the language decoder. That controlled diagnostic framing and the perturbation framework are the genuinely new pieces; prior VLM counting work has not combined these exact factors with attention analysis in this way. The empirical trends are clear enough from the abstract to make the benchmark worth sharing. The main soft spot is the attention reweighting section. It is labeled exploratory and lacks the ablations that would separate specific effects on quantity grounding from generic changes in token salience or decoder behavior, such as uniform scaling or random shifts at similar magnitudes. The abstract also omits error bars, statistical tests, and exact effect sizes, so the intervention claims rest on thinner ground than the degradation patterns. This work is aimed at researchers building or debugging visual reasoning systems who need a clean testbed for enumeration rather than a new state-of-the-art score. A reader interested in cross-modal binding or diagnostic tools for VLMs will get practical value from the benchmark and code release. The paper is coherent on its own terms and shows honest engagement with the limits of current models, so it deserves a serious referee. I would send it for review but ask for the missing controls on the reweighting experiments and fuller statistical reporting before acceptance.

Referee Report

1 major / 2 minor

Summary. The paper introduces a synthetic benchmark and evaluation framework to characterize counting performance in open-source vision-language models under controlled perturbations of visual properties (object count, color, texture, background) and linguistic factors (prompt specificity). It examines corresponding shifts in visual attention allocation and conducts exploratory attention reweighting interventions in the language decoder layers to modulate focus on visual tokens. Key results indicate systematic accuracy degradation with rising visual and linguistic complexity, mirroring human cognitive load effects, alongside modest measurable gains from the reweighting approach. The work positions itself as a diagnostic tool for cross-modal binding failures rather than a new accuracy benchmark, with code and data released.

Significance. If the central empirical trends hold under stronger controls, the synthetic diagnostic framework would provide a useful complement to natural-image benchmarks by isolating factors that affect enumeration and cross-modal grounding in VLMs. The attention analysis and preliminary reweighting results offer mechanistic insights that could guide future inference-time interventions or architectural changes for quantity reasoning. Reproducibility is supported by the public code and data release.

major comments (1)

[Attention reweighting experiments] Attention reweighting experiments (abstract and corresponding results section): the manuscript reports modest counting gains from targeted reweighting at different decoder layers but does not present control ablations such as uniform scaling of all visual tokens, reweighting of non-visual tokens, or random position/magnitude shifts matched to the intervention strength. Without these, the results do not yet isolate effects specific to quantity-concept grounding from generic changes in token salience or decoder dynamics, weakening support for the claim that the operation influences cross-modal binding for linguistic quantity concepts.

minor comments (2)

[Abstract] Abstract: the summary of results mentions 'modest but measurable improvements' and 'systematic degradation' without numerical magnitudes, error bars, or reference to statistical tests; adding these details would improve clarity and allow readers to assess effect sizes directly.
[Benchmark construction] The weakest assumption noted in the stress-test (synthetic perturbations isolating intended binding failures without new artifacts) is not explicitly tested or discussed in the manuscript; a brief validation against a small set of natural images or artifact checks would strengthen the framework's claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We appreciate the positive assessment of the synthetic benchmark's diagnostic value and the attention analysis. We address the single major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: Attention reweighting experiments (abstract and corresponding results section): the manuscript reports modest counting gains from targeted reweighting at different decoder layers but does not present control ablations such as uniform scaling of all visual tokens, reweighting of non-visual tokens, or random position/magnitude shifts matched to the intervention strength. Without these, the results do not yet isolate effects specific to quantity-concept grounding from generic changes in token salience or decoder dynamics, weakening support for the claim that the operation influences cross-modal binding for linguistic quantity concepts.

Authors: We agree that the current set of experiments would benefit from additional controls to better isolate whether the observed gains stem specifically from enhanced cross-modal binding for quantity concepts rather than generic alterations in token salience or decoder behavior. Our reweighting experiments were presented as exploratory, with the goal of providing preliminary evidence that targeted interventions in the language decoder can measurably affect counting performance. To address the concern, we will add control ablations in the revised manuscript, including (1) uniform scaling applied to all visual tokens and (2) random position/magnitude perturbations matched in strength to the targeted interventions. These will be reported alongside the existing results to clarify the specificity of the effects. We believe this will strengthen support for the mechanistic interpretation without altering the core claims of the work. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements on synthetic data

full rationale

The paper conducts controlled experiments on a synthetic benchmark, directly measuring counting accuracy against ground-truth object counts under perturbations of visual and linguistic factors. Attention reweighting is applied as an exploratory intervention with reported effects on performance. No derivations, equations, fitted parameters renamed as predictions, or self-referential steps appear in the described framework. Results are evaluated externally against the synthetic ground truth, rendering the analysis self-contained without any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on the assumption that the chosen synthetic perturbations faithfully expose cross-modal binding issues and that attention reweighting operates as an independent causal lever; no free parameters, axioms, or invented entities are introduced beyond standard VLM architecture assumptions.

pith-pipeline@v0.9.0 · 5574 in / 1093 out tokens · 30724 ms · 2026-05-17T20:01:08.878400+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We further conduct exploratory attention reweighting experiments in the language model decoder to modulate focus on visual tokens at different layers and assess their effects on counting behavior.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models
cs.CV 2026-05 unverdicted novelty 7.0

CounterCount shows VLMs perform well on factual counting images but degrade on counterfactual edits, revealing reliance on object priors, and introduces an attention reweighting method that improves accuracy by up to 8%.
PushupBench: Your VLM is not good at counting pushups
cs.CV 2026-04 unverdicted novelty 7.0

VLMs reach only 42.1% exact accuracy on counting pushups in videos, with weaker models exploiting modal counts, and 1k-sample fine-tuning transfers gains to MVBench, PerceptionTest, and TVBench.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 2 Pith papers · 1 internal anchor

[1]

[de— re] constructing vlms’ reasoning in counting.arXiv preprint arXiv:2510.19555, 2025

Simone Alghisi, Gabriel Roccabruna, Massimo Rizzoli, Seyed Mahed Mousavi, and Giuseppe Riccardi. [de— re] constructing vlms’ reasoning in counting.arXiv preprint arXiv:2510.19555, 2025. 1, 2

work page arXiv 2025
[2]

Countgd: Multi-modal open-world counting.Advances in Neural Information Processing Systems, 37:48810–48837,

Niki Amini-Naieni, Tengda Han, and Andrew Zisserman. Countgd: Multi-modal open-world counting.Advances in Neural Information Processing Systems, 37:48810–48837,

work page
[3]

Mitigating object hallucinations in large vision- language models with assembly of global and local attention

Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, QianYing Wang, Ping Chen, Xiaoqin Zhang, and Shi- jian Lu. Mitigating object hallucinations in large vision- language models with assembly of global and local attention. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 29915–29926, 2025. 2

work page 2025
[4]

Generic attention- model explainability for interpreting bi-modal and encoder- decoder transformers

Hila Chefer, Shir Gur, and Lior Wolf. Generic attention- model explainability for interpreting bi-modal and encoder- decoder transformers. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 397–406, 2021. 1

work page 2021
[5]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 1

work page 2024
[6]

Probing the visualiza- tion literacy of vision language models: the good, the bad, and the ugly.arXiv preprint arXiv:2504.05445, 2025

Lianghan Dong and Anamaria Crisan. Probing the visualiza- tion literacy of vision language models: the good, the bad, and the ugly.arXiv preprint arXiv:2504.05445, 2025. 2

work page arXiv 2025
[7]

Google deepmind: Gemini 2.5 pro, 2025.https: //deepmind.google/models/gemini/pro/

Google. Google deepmind: Gemini 2.5 pro, 2025.https: //deepmind.google/models/gemini/pro/. 2

work page 2025
[8]

Your vision-language model can’t even count to 20: Exposing the failures of vlms in compositional counting

Xuyang Guo, Zekai Huang, Zhenmei Shi, Zhao Song, and Ji- ahao Zhang. Your vision-language model can’t even count to 20: Exposing the failures of vlms in compositional counting. arXiv preprint arXiv:2510.04401, 2025. 1, 2

work page arXiv 2025
[9]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. InProceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 4

work page 2017
[10]

Do vision- language models really understand visual language? arXiv preprint arXiv:2410.00193,

Yifan Hou, Buse Giledereli, Yilei Tu, and Mrinmaya Sachan. Do vision-language models really understand visual lan- guage?arXiv preprint arXiv:2410.00193, 2024. 2

work page arXiv 2024
[11]

Point segment and count: A gener- alized framework for object counting

Zhizhong Huang, Mingliang Dai, Yi Zhang, Junping Zhang, and Hongming Shan. Point segment and count: A gener- alized framework for object counting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17067–17076, 2024. 1

work page 2024
[12]

See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321, 2025

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321, 2025. 2

work page arXiv 2025
[13]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 4

work page 2023
[14]

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, et al

Kang-il Lee, Minbeom Kim, Seunghyun Yoon, Minsung Kim, Dongryeol Lee, Hyukhun Koh, and Kyomin Jung. Vlind-bench: Measuring language priors in large vision- language models.arXiv preprint arXiv:2406.08702, 2024. 2

work page arXiv 2024
[15]

Vhelm: A holistic evaluation of vision language models.Advances in Neural Information Processing Systems, 37:140632–140666, 2024

Tony Lee, Haoqin Tu, Chi H Wong, Wenhao Zheng, Yiyang Zhou, Yifan Mai, Josselin S Roberts, Michihiro Yasunaga, Huaxiu Yao, Cihang Xie, et al. Vhelm: A holistic evaluation of vision language models.Advances in Neural Information Processing Systems, 37:140632–140666, 2024. 2

work page 2024
[16]

Open ai: Introducing openai o3 and o4-mini, 2025

OpenAI. Open ai: Introducing openai o3 and o4-mini, 2025. https://openai.com/index/introducing-o3- and-o4-mini/. 2

work page 2025
[17]

Crowd- diff: Multi-hypothesis crowd density estimation using dif- fusion models

Yasiru Ranasinghe, Nithin Gopalakrishnan Nair, Wele Gedara Chaminda Bandara, and Vishal M Patel. Crowd- diff: Multi-hypothesis crowd density estimation using dif- fusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12809– 12819, 2024. 1

work page 2024
[18]

Vision-language foundation models for medical imag- ing: a review of current practices and innovations.Biomedi- cal Engineering Letters, pages 1–22, 2025

Ji Seung Ryu, Hyunyoung Kang, Yuseong Chu, and Sejung Yang. Vision-language foundation models for medical imag- ing: a review of current practices and innovations.Biomedi- cal Engineering Letters, pages 1–22, 2025. 1

work page 2025
[19]

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

Rui Shao, Wei Li, Lingsen Zhang, Renshan Zhang, Zhiyang Liu, Ran Chen, and Liqiang Nie. Large vlm-based vision- language-action models for robotic manipulation: A survey. arXiv preprint arXiv:2508.13073, 2025. 1

work page internal anchor Pith review arXiv 2025
[20]

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Hao- tian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, H...

work page 2025
[21]

Qwen2.5-vl, 2025

Qwen Team. Qwen2.5-vl, 2025. 1

work page 2025
[22]

Vi- sion language models are biased, 2025

An V o, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, and Daeyoung Kim. Vi- sion language models are biased, 2025. 2 9 Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions Supplementary Material

work page 2025
[23]

Layer-wise Propagation of Visual attention Gradient-weighted attention.Inspired by Chefer et al

work page
[24]

propose, we propose a lightweight gradient-weighted relevance propagation(LPV) for autoregressive VLMs that turns layer-wise attentions into token-level relevance maps by using gradient weighting and cross-layer diffusion. For each Transformer layerℓ, letA (ℓ) ∈R H×S×S be the multi- head attention (post-softmax) and let G(ℓ) = ∂L ∂A(ℓ) be its gradient obt...

work page
[25]

Count the number of objects in this image. Answer the count within curly brackets, eg.{10}

Attention Reweighting in Qwen Models 8.1. Attention Reweighting in Grouped Query At- tention Architecture Qwen 2.5 and Qwen 3 models employ Grouped Query At- tention (GQA) , which differs from standard Multi-Head Attention by using fewer key-value heads than query heads to reduce computational cost. Specifically, withH= 32 attention heads andK= 8key-value...

work page
[26]

Example images for theObjectcategory,Colorpattern, showing different object colors

Sample Images (a) black (b) white (c) red (d) yellow (e) blue (f) light gray (g) green (h) multicolor Figure 4. Example images for theObjectcategory,Colorpattern, showing different object colors. 4 (a) Checkerboard (b) Concentric Circles (c) Crosshatch (d) Diagonal Stripes (e) Dots (f) Horizontal Stripes (g) Linear Gradient (h) Radial Gradient (i) Vertica...

work page
[27]

circles”(as default in color experiment), “squares

Prompts Table 9. Prompts used when image has different Object Color or Shape. ID Example Prompt Text Logical Role / Cognitive Cue P1Count the number of distinct objects in this image... Baseline:Generic unconstrained prompt. P2Count the number of{color}color objects in this image... Single (Simple) Attribute:Simple target Cue (Color) - Replace {color}with...

work page
[28]

Effects of Visual Complexity Tables 12, Table 13, Table 14, and Table 15 present the Mean Relative Count Error (MRCE) for prompts 1, 3, 4, and 5, respectively

work page
[29]

blue-green

Attention on Visual Tokens Figures 9-10 shows the distribution of attention over vi- sion as well as counting error across prompts for the Qwen2.5-32B-Instruct and InternVL3-9B. Across models (i.e. Qwen2.5-32B-Instruct, InternVL3-9B, Qwen2.5-7B, and Kimi-VL-A3B), we observe a consistent divide in how architectural scale influences the effect of prompt spe...

work page

[1] [1]

[de— re] constructing vlms’ reasoning in counting.arXiv preprint arXiv:2510.19555, 2025

Simone Alghisi, Gabriel Roccabruna, Massimo Rizzoli, Seyed Mahed Mousavi, and Giuseppe Riccardi. [de— re] constructing vlms’ reasoning in counting.arXiv preprint arXiv:2510.19555, 2025. 1, 2

work page arXiv 2025

[2] [2]

Countgd: Multi-modal open-world counting.Advances in Neural Information Processing Systems, 37:48810–48837,

Niki Amini-Naieni, Tengda Han, and Andrew Zisserman. Countgd: Multi-modal open-world counting.Advances in Neural Information Processing Systems, 37:48810–48837,

work page

[3] [3]

Mitigating object hallucinations in large vision- language models with assembly of global and local attention

Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, QianYing Wang, Ping Chen, Xiaoqin Zhang, and Shi- jian Lu. Mitigating object hallucinations in large vision- language models with assembly of global and local attention. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 29915–29926, 2025. 2

work page 2025

[4] [4]

Generic attention- model explainability for interpreting bi-modal and encoder- decoder transformers

Hila Chefer, Shir Gur, and Lior Wolf. Generic attention- model explainability for interpreting bi-modal and encoder- decoder transformers. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 397–406, 2021. 1

work page 2021

[5] [5]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 1

work page 2024

[6] [6]

Probing the visualiza- tion literacy of vision language models: the good, the bad, and the ugly.arXiv preprint arXiv:2504.05445, 2025

Lianghan Dong and Anamaria Crisan. Probing the visualiza- tion literacy of vision language models: the good, the bad, and the ugly.arXiv preprint arXiv:2504.05445, 2025. 2

work page arXiv 2025

[7] [7]

Google deepmind: Gemini 2.5 pro, 2025.https: //deepmind.google/models/gemini/pro/

Google. Google deepmind: Gemini 2.5 pro, 2025.https: //deepmind.google/models/gemini/pro/. 2

work page 2025

[8] [8]

Your vision-language model can’t even count to 20: Exposing the failures of vlms in compositional counting

Xuyang Guo, Zekai Huang, Zhenmei Shi, Zhao Song, and Ji- ahao Zhang. Your vision-language model can’t even count to 20: Exposing the failures of vlms in compositional counting. arXiv preprint arXiv:2510.04401, 2025. 1, 2

work page arXiv 2025

[9] [9]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. InProceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 4

work page 2017

[10] [10]

Do vision- language models really understand visual language? arXiv preprint arXiv:2410.00193,

Yifan Hou, Buse Giledereli, Yilei Tu, and Mrinmaya Sachan. Do vision-language models really understand visual lan- guage?arXiv preprint arXiv:2410.00193, 2024. 2

work page arXiv 2024

[11] [11]

Point segment and count: A gener- alized framework for object counting

Zhizhong Huang, Mingliang Dai, Yi Zhang, Junping Zhang, and Hongming Shan. Point segment and count: A gener- alized framework for object counting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17067–17076, 2024. 1

work page 2024

[12] [12]

See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321, 2025

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321, 2025. 2

work page arXiv 2025

[13] [13]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 4

work page 2023

[14] [14]

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, et al

Kang-il Lee, Minbeom Kim, Seunghyun Yoon, Minsung Kim, Dongryeol Lee, Hyukhun Koh, and Kyomin Jung. Vlind-bench: Measuring language priors in large vision- language models.arXiv preprint arXiv:2406.08702, 2024. 2

work page arXiv 2024

[15] [15]

Vhelm: A holistic evaluation of vision language models.Advances in Neural Information Processing Systems, 37:140632–140666, 2024

Tony Lee, Haoqin Tu, Chi H Wong, Wenhao Zheng, Yiyang Zhou, Yifan Mai, Josselin S Roberts, Michihiro Yasunaga, Huaxiu Yao, Cihang Xie, et al. Vhelm: A holistic evaluation of vision language models.Advances in Neural Information Processing Systems, 37:140632–140666, 2024. 2

work page 2024

[16] [16]

Open ai: Introducing openai o3 and o4-mini, 2025

OpenAI. Open ai: Introducing openai o3 and o4-mini, 2025. https://openai.com/index/introducing-o3- and-o4-mini/. 2

work page 2025

[17] [17]

Crowd- diff: Multi-hypothesis crowd density estimation using dif- fusion models

Yasiru Ranasinghe, Nithin Gopalakrishnan Nair, Wele Gedara Chaminda Bandara, and Vishal M Patel. Crowd- diff: Multi-hypothesis crowd density estimation using dif- fusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12809– 12819, 2024. 1

work page 2024

[18] [18]

Vision-language foundation models for medical imag- ing: a review of current practices and innovations.Biomedi- cal Engineering Letters, pages 1–22, 2025

Ji Seung Ryu, Hyunyoung Kang, Yuseong Chu, and Sejung Yang. Vision-language foundation models for medical imag- ing: a review of current practices and innovations.Biomedi- cal Engineering Letters, pages 1–22, 2025. 1

work page 2025

[19] [19]

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

Rui Shao, Wei Li, Lingsen Zhang, Renshan Zhang, Zhiyang Liu, Ran Chen, and Liqiang Nie. Large vlm-based vision- language-action models for robotic manipulation: A survey. arXiv preprint arXiv:2508.13073, 2025. 1

work page internal anchor Pith review arXiv 2025

[20] [20]

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Hao- tian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, H...

work page 2025

[21] [21]

Qwen2.5-vl, 2025

Qwen Team. Qwen2.5-vl, 2025. 1

work page 2025

[22] [22]

Vi- sion language models are biased, 2025

An V o, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, and Daeyoung Kim. Vi- sion language models are biased, 2025. 2 9 Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions Supplementary Material

work page 2025

[23] [23]

Layer-wise Propagation of Visual attention Gradient-weighted attention.Inspired by Chefer et al

work page

[24] [24]

propose, we propose a lightweight gradient-weighted relevance propagation(LPV) for autoregressive VLMs that turns layer-wise attentions into token-level relevance maps by using gradient weighting and cross-layer diffusion. For each Transformer layerℓ, letA (ℓ) ∈R H×S×S be the multi- head attention (post-softmax) and let G(ℓ) = ∂L ∂A(ℓ) be its gradient obt...

work page

[25] [25]

Count the number of objects in this image. Answer the count within curly brackets, eg.{10}

Attention Reweighting in Qwen Models 8.1. Attention Reweighting in Grouped Query At- tention Architecture Qwen 2.5 and Qwen 3 models employ Grouped Query At- tention (GQA) , which differs from standard Multi-Head Attention by using fewer key-value heads than query heads to reduce computational cost. Specifically, withH= 32 attention heads andK= 8key-value...

work page

[26] [26]

Example images for theObjectcategory,Colorpattern, showing different object colors

Sample Images (a) black (b) white (c) red (d) yellow (e) blue (f) light gray (g) green (h) multicolor Figure 4. Example images for theObjectcategory,Colorpattern, showing different object colors. 4 (a) Checkerboard (b) Concentric Circles (c) Crosshatch (d) Diagonal Stripes (e) Dots (f) Horizontal Stripes (g) Linear Gradient (h) Radial Gradient (i) Vertica...

work page

[27] [27]

circles”(as default in color experiment), “squares

Prompts Table 9. Prompts used when image has different Object Color or Shape. ID Example Prompt Text Logical Role / Cognitive Cue P1Count the number of distinct objects in this image... Baseline:Generic unconstrained prompt. P2Count the number of{color}color objects in this image... Single (Simple) Attribute:Simple target Cue (Color) - Replace {color}with...

work page

[28] [28]

Effects of Visual Complexity Tables 12, Table 13, Table 14, and Table 15 present the Mean Relative Count Error (MRCE) for prompts 1, 3, 4, and 5, respectively

work page

[29] [29]

blue-green

Attention on Visual Tokens Figures 9-10 shows the distribution of attention over vi- sion as well as counting error across prompts for the Qwen2.5-32B-Instruct and InternVL3-9B. Across models (i.e. Qwen2.5-32B-Instruct, InternVL3-9B, Qwen2.5-7B, and Kimi-VL-A3B), we observe a consistent divide in how architectural scale influences the effect of prompt spe...

work page