arxiv: 2605.02752 · v2 · submitted 2026-05-04 · 💻 cs.CV

Recognition: no theorem link

Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting

Fabrizio Falchi, Giacomo Pacini, Giuseppe Amato, Luca Ciampi, Nicola Messina, Nicola Tonellotto

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords class-agnostic countingtext-guided countingsemantic groundingevaluation protocolsMUCCA datasetprompt understandingvisual groundingopen-world vision

0 comments

The pith

Current text-guided class-agnostic counting models fail to ground prompt meanings in visual scenes despite strong standard counting scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that state-of-the-art models for counting arbitrary objects from natural language prompts often count the wrong objects or ignore the prompt entirely when scenes contain multiple categories. It introduces PrACo++ with negative-label and distractor protocols plus the MUCCA dataset of real multi-category images to expose these grounding failures. Standard metrics miss the issue because they test single-category images where models can succeed by pattern matching rather than semantic understanding. The work shows that prompt similarity increases error rates and calls for architectures that align text descriptions with specific visual objects. This matters for any real-world use where the counted class must match the user's intent exactly.

Core claim

Despite low counting errors on existing benchmarks, current text-guided class-agnostic counting models frequently produce spurious counts when prompts are altered with negative labels or distractor classes, revealing that they do not correctly map textual object descriptions to the corresponding visual instances in multi-category scenes.

What carries the argument

The negative-label test and distractor test within the PrACo++ framework, which quantify drops in counting accuracy when prompts specify absent or competing classes, evaluated on the MUCCA dataset of multi-annotated real-world scenes.

If this is right

Current models require new architectures that explicitly align textual semantics with visual object features instead of relying on overall scene statistics.
Evaluation protocols for text-guided counting must incorporate negative and distractor cases to measure trustworthiness beyond raw count error.
Semantic similarity between alternative prompts directly increases the rate of grounding failures across tested methods.
Real-world deployments in open scenes will produce unreliable results unless models pass the new robustness tests.
The MUCCA dataset enables systematic comparison of future methods on multi-category grounding.
pith_inferences:[

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could add explicit grounding modules that first localize candidate objects before counting to reduce prompt misalignment.
The observed failures likely extend to other vision-language tasks where descriptions must be matched to specific instances rather than global image properties.
The protocols could be adapted to video sequences to test whether motion cues improve semantic grounding over static images.
Quantitative analysis of prompt similarity suggests targeted fine-tuning on hard negative examples would improve reliability.

Load-bearing premise

The negative-label and distractor protocols together with the MUCCA dataset isolate failures of semantic grounding without being confounded by localization errors or prompt phrasing sensitivity.

What would settle it

Models that maintain high counting accuracy and correct class selection on the negative-label and distractor tests within the MUCCA dataset, rather than defaulting to visually dominant objects unrelated to the prompt.

Figures

Figures reproduced from arXiv: 2605.02752 by Fabrizio Falchi, Giacomo Pacini, Giuseppe Amato, Luca Ciampi, Nicola Messina, Nicola Tonellotto.

**Figure 1.** Figure 1: High-level overview of our PrACo++ test suite. We empirically show that SOTA open-world text-guided CAC methods are not properly evaluated by current benchmarks, as they fail to assess the alignment between textual semantic understanding and counting accuracy. To address this limitation, we introduce a new test suite, Prompt-Aware Counting++ (PrACo++), composed of two complementary tests. (i) On the left, … view at source ↗

**Figure 2.** Figure 2: Overview of the inference procedure used in the test suites. For each image, the model predicts object counts for all categories in the dataset. Orange boxes report the ground-truth counts for categories present in the image, while gray boxes denote categories not present (ground-truth count equal to zero). In the negative-label test, we consider only categories not present in the image (gray boxes) and ev… view at source ↗

**Figure 3.** Figure 3: Example illustrating the computation of the proposed counting metrics on the distractor test. The image is partitioned into spatial patches, and the output density map is integrated within each patch to obtain patch-level predicted counts. For each patch, we compute true positives (TP), false positives (FP), false negatives (FN), and the mean absolute error (MAE), and subsequently aggregate their contribut… view at source ↗

**Figure 4.** Figure 4: Samples from the MUCCA dataset. We show a selection of dot-annotated images from our MUlti-Category Class-Agnostic counting (MUCCA) dataset, a collection specifically designed for CAC and characterized by the presence of multiple object categories annotated within each image. people cars motorbikesdogs apples trucks guitars chairs raspberries blueberries blackberrieskeys amplifiers plates knives balconies … view at source ↗

**Figure 5.** Figure 5: Statistical overview of the MUCCA dataset. Top: frequency of object classes measured by the number of images in which each class appears. Bottom-left: distribution of images according to the number of distinct object categories they contain, highlighting the intrinsic multi-class nature of the dataset. Bottom-right: distribution of the total number of object instances per image, illustrating the wide varia… view at source ↗

**Figure 5.** Figure 5: 5. Experimental Evaluation 5.1. Experimental Setting We conduct two sets of experiments using our PrACo++ test suite on two datasets, covering both the negative-label test and the distractor test. For the first set of experiments, we employ the FSC-147 dataset [1], the gold standard for CAC, which contains 6,135 images spanning 147 categories (see Sec. 2.2 for further details). Since FSC-147 is a single-c… view at source ↗

**Figure 6.** Figure 6: Effect of textual prompt semantic similarity in the negative-label test. (a) Pearson correlation between semantic similarity (CLIP cosine similarity) and normalized counting error for each evaluated model. (b) Quartile distributions of normalized counting errors grouped into five equal-width semantic similarity bins. For each negative category in an image, the counting error is given by the model predict… view at source ↗

**Figure 7.** Figure 7: Qualitative results for the negative-label test. We present sample images from both the FSC-147 and MUCCA datasets, alongside the density maps generated by the evaluated text-guided CAC models when queried with absent object categories (negative prompts). resulting in hallucinated counts driven by dominant object categories in the scene rather than by the actual query. We argue that these failures arise fr… view at source ↗

**Figure 8.** Figure 8: Qualitative results for the distractor test. We showcase the behavior of various models when tasked with counting a target class in the presence of confusing, co-occurring object categories. The examples include synthetic mosaics constructed from single-class FSC-147 images (top rows) and real-world multi-class scenes from the MUCCA dataset (bottom rows). instances, this relationship appears measurable, th… view at source ↗

read the original abstract

Open-world text-guided class-agnostic counting (CAC) has emerged as a flexible paradigm for counting arbitrary object classes by using natural language prompts. However, current evaluation protocols primarily focus on standard counting errors within single-category images, overlooking a fundamental requirement: the ability to correctly ground the textual prompt in the visual scene. In this paper, we show that several state-of-the-art CAC models often struggle to determine which object class should be counted based on the given prompt, revealing a misalignment between textual semantics and visual object representations. This limitation leads to spurious counting responses and reduced reliability in real-world scenarios. To systematically address these limitations, we propose a new evaluation framework focused on model robustness and trustworthiness. Our contribution is two-fold: (i) we introduce PrACo++ (Prompt-Aware Counting++), a novel test suite featuring two dedicated evaluation protocols -- the negative-label test and the distractor test -- paired with new specialized metrics; and (ii) we present the MUCCA (MUlti-Category Class-Agnostic counting) evaluation dataset, a new collection of real-world images featuring multiple annotated object categories per scene, unlike existing CAC benchmarks that typically include a single category per image. Our extensive experimental evaluation of 10 state-of-the-art methods shows that, despite strong performance under standard counting metrics, current models exhibit significant weaknesses in understanding and grounding object class descriptions. Finally, we provide a quantitative analysis of how semantic similarity between prompts influences these failures. Overall, our results underscore the need for more semantically grounded architectures and offer a reliable framework for future assessment in open-world text-guided CAC methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that current text-guided CAC models often fail to ground prompts correctly and supplies new protocols plus a multi-category dataset to expose it, though the tests may mix grounding issues with localization demands.

read the letter

The paper's key point is that text-guided class-agnostic counting models often fail to ground the prompt semantics correctly, leading to wrong objects being counted, and it provides new ways to test for that. What stands out is the introduction of PrACo++ with its negative-label and distractor protocols, along with the MUCCA dataset of multi-category scenes. They evaluate ten existing methods and demonstrate that standard metrics hide these issues, plus they quantify how prompt similarity plays into the errors. That's concrete and practical for anyone trying to make these systems reliable. The soft spot is that the performance drops could be driven by the increased difficulty of multi-object localization and instance separation in MUCCA, rather than pure grounding failures. The negative-label and distractor prompts also change the input structure, so without controls like single-category subsets or localization oracles, it's not fully clear that the tests isolate semantic misalignment. The abstract mentions strong standard results but doesn't specify if those were measured on the same images. This work is aimed at people developing or evaluating open-world vision-language counting systems. It gives them better tools to check trustworthiness. I think it deserves a serious referee because the new artifacts are useful and the central observation seems worth investigating further, even if the experimental design needs some refinement to pin down the cause.

Referee Report

3 major / 2 minor

Summary. The paper claims that state-of-the-art text-guided class-agnostic counting (CAC) models perform well on standard counting metrics for single-category images but exhibit significant failures in semantically grounding textual prompts to visual objects. To expose this, the authors introduce PrACo++ (with negative-label and distractor protocols plus new metrics) and the MUCCA dataset of multi-category real-world images. Experiments on 10 methods show prompt-induced errors and a quantitative link to semantic similarity between prompts; the work argues for more semantically grounded architectures and supplies a new evaluation framework.

Significance. If the central claim holds after addressing controls, the work is significant: it identifies a previously under-tested failure mode in open-world CAC, supplies the first dedicated protocols and multi-category benchmark for trustworthiness, and demonstrates that standard metrics are insufficient. The empirical analysis of semantic similarity provides a concrete, falsifiable direction for future model design.

major comments (3)

[Abstract, §4] Abstract and §4 (MUCCA construction): the multi-category scenes increase localization and instance-separation demands relative to single-category benchmarks. Without reporting results on single-category subsets of MUCCA or localization-oracle baselines, it is unclear whether the observed drops on PrACo++ protocols are attributable to semantic grounding failures or to these confounding factors.
[§3.2] §3.2 (negative-label and distractor protocols): the new prompt structures differ from the original training/inference prompts used in the 10 evaluated methods. Without an ablation that holds prompt phrasing fixed while varying only label semantics, performance degradation could reflect prompt sensitivity rather than grounding misalignment.
[§5] §5 (experimental results): the abstract states strong standard-metric performance, yet the main text does not clarify whether those metrics were computed on the same multi-category MUCCA images used for the PrACo++ tests. This missing cross-protocol comparison weakens the claim that standard metrics are insufficient.

minor comments (2)

[§3.3] Notation for the new metrics (e.g., definitions of negative-label accuracy and distractor error) should be collected in a single table or subsection for easier reference.
[§4] The paper should report the number of images and category pairs in MUCCA and the exact train/test split used for the 10 methods.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important aspects of experimental controls and clarity that will strengthen the manuscript. We address each major comment below and will incorporate the suggested revisions in the next version.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (MUCCA construction): the multi-category scenes increase localization and instance-separation demands relative to single-category benchmarks. Without reporting results on single-category subsets of MUCCA or localization-oracle baselines, it is unclear whether the observed drops on PrACo++ protocols are attributable to semantic grounding failures or to these confounding factors.

Authors: We agree that multi-category scenes introduce additional localization and separation challenges. To isolate semantic grounding effects, the revised manuscript will include an analysis on single-category subsets of MUCCA (selected images containing only one annotated category) as well as localization-oracle baselines that supply ground-truth boxes to the models. These additions will allow direct comparison and better attribute performance drops to prompt grounding rather than localization demands. revision: yes
Referee: [§3.2] §3.2 (negative-label and distractor protocols): the new prompt structures differ from the original training/inference prompts used in the 10 evaluated methods. Without an ablation that holds prompt phrasing fixed while varying only label semantics, performance degradation could reflect prompt sensitivity rather than grounding misalignment.

Authors: This point is well taken. While the protocols were designed to probe semantic robustness, we will add a controlled ablation in the revision that fixes prompt phrasing to match the original methods' templates and varies only the label semantics (positive vs. negative or distractor). This will separate prompt sensitivity from true grounding misalignment and strengthen the interpretation of the results. revision: yes
Referee: [§5] §5 (experimental results): the abstract states strong standard-metric performance, yet the main text does not clarify whether those metrics were computed on the same multi-category MUCCA images used for the PrACo++ tests. This missing cross-protocol comparison weakens the claim that standard metrics are insufficient.

Authors: We apologize for the ambiguity. The strong standard-metric results cited in the abstract were obtained on existing single-category benchmarks such as FSC-147. In the revised §5 we will explicitly compute and report standard counting metrics on the MUCCA images themselves, enabling a direct side-by-side comparison with the PrACo++ protocol results on identical data. This will make the insufficiency of standard metrics on multi-category scenes unambiguous. revision: yes

Circularity Check

0 steps flagged

No circularity: new empirical protocols and dataset are independent contributions

full rationale

The paper introduces PrACo++ protocols (negative-label and distractor tests) and the MUCCA dataset as novel evaluation tools to measure semantic grounding failures in CAC models. These are not derived from prior fitted parameters or self-referential definitions; they are presented as new test suites with specialized metrics. The central claim of model weaknesses is supported by direct experimental results on these introduced benchmarks rather than any reduction to inputs by construction. No equations, uniqueness theorems, or ansatzes are invoked that loop back to the paper's own fitted values or self-citations. The work is self-contained against external benchmarks via the new dataset and protocols.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical benchmarking study that relies on standard computer-vision evaluation assumptions rather than new mathematical derivations or postulated entities.

axioms (1)

domain assumption Standard assumptions about image annotation quality and metric computation in computer vision benchmarks
Invoked when defining counting errors and the new specialized metrics for the negative-label and distractor tests

pith-pipeline@v0.9.0 · 5611 in / 1115 out tokens · 33638 ms · 2026-05-14T20:49:55.255594+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 45 canonical work pages · 1 internal anchor

[1]

Counting vehicles with deep learning in onboard uav imagery, in: 2019 IEEE Symposium on Computers and Communications (ISCC), pp

Amato, G., Ciampi, L., Falchi, F., Gennaro, C., 2019. Counting vehicles with deep learning in onboard uav imagery, in: 2019 IEEE Symposium on Computers and Communications (ISCC), pp. 1–6. doi:10.1109/ISCC47284.2019.8969620

work page doi:10.1109/iscc47284.2019.8969620 2019
[2]

Open-world text-specified object counting, in: 34th British Machine VisionConference2023,BMVC2023,Aberdeen,UK,November20- 23, 2023

Amini-Naieni, N., Amini-Naieni, K., Han, T., Zisserman, A., 2023. Open-world text-specified object counting, in: 34th British Machine VisionConference2023,BMVC2023,Aberdeen,UK,November20- 23, 2023

2023
[3]

Countgd: Multi- modal open-world counting

Amini-Naieni, N., Han, T., Zisserman, A., 2024. Countgd: Multi- modal open-world counting. CoRR abs/2407.04619. doi:10.48550/ ARXIV.2407.04619,arXiv:2407.04619

work page arXiv 2024
[4]

Counting in the wild,in:ComputerVision-ECCV2016-14thEuropeanConference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, PartVII,Springer.pp.483–498

Arteta, C., Lempitsky, V.S., Zisserman, A., 2016. Counting in the wild,in:ComputerVision-ECCV2016-14thEuropeanConference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, PartVII,Springer.pp.483–498. doi:10.1007/978-3-319-46478-7\_30

work page doi:10.1007/978-3-319-46478-7 2016
[5]

Anembeddedtoolsetforhumanactivitymonitoring in critical environments

Benedetto, M.D., Carrara, F., Ciampi, L., Falchi, F., Gennaro, C., Amato,G.,2022. Anembeddedtoolsetforhumanactivitymonitoring in critical environments. Expert Syst. Appl. 199, 117125. doi:10. 1016/J.ESWA.2022.117125

work page arXiv 2022
[6]

Insect counting through deep learning- based density maps estimation

Bereciartua-Pérez, A., Gómez, L., Picón, A., Navarra-Mestre, R., Klukas, C., Eggers, T., 2022. Insect counting through deep learning- based density maps estimation. Computers and Electronics in Agri- culture 197, 106933. doi:https://doi.org/10.1016/j.compag.2022. 106933

work page doi:10.1016/j.compag.2022 2022
[7]

A survey on class- agnostic counting: Advancements from reference-based to open- world text-guided approaches

Ciampi, L., Azmoudeh, A., Akbaba, E.E., Saritas, E., Yazici, Z.A., Ekenel, H.K., Amato, G., Falchi, F., 2026a. A survey on class- agnostic counting: Advancements from reference-based to open- world text-guided approaches. Comput. Vis. Image Underst. 267, 104703. URL:https://doi.org/10.1016/j.cviu.2026.104703, doi:10. 1016/J.CVIU.2026.104703

work page doi:10.1016/j.cviu.2026.104703 2026
[8]

Learningtocount biologicalstructureswithraters’uncertainty

Ciampi,L.,Carrara,F.,Totaro,V.,Mazziotti,R.,Lupori,L.,Santiago, C.,Amato,G.,Pizzorusso,T.,Gennaro,C.,2022a. Learningtocount biologicalstructureswithraters’uncertainty. MedicalImageAnalysis 80, 102500. doi:https://doi.org/10.1016/j.media.2022.102500

work page doi:10.1016/j.media.2022.102500 2022
[9]

Multi-camera vehicle counting using edge-ai

Ciampi,L.,Gennaro,C.,Carrara,F.,Falchi,F.,Vairo,C.,Amato,G., 2022b. Multi-camera vehicle counting using edge-ai. Expert Syst. Appl. 207, 117929. doi:10.1016/J.ESWA.2022.117929

work page doi:10.1016/j.eswa.2022.117929 2022
[10]

Ciampi, L., Messina, N., Pierucci, M., Amato, G., Avvenuti, M., Falchi, F., 2025. Mind the prompt: A novel benchmark for prompt- based class-agnostic counting, in: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025, Tucson, AZ, USA, February 26 - March 6, 2025, IEEE. pp. 7970–7979. doi:10.1109/ WACV61041.2025.00774

work page arXiv 2025
[11]

Mucca (multi-category class-agnostic counting) dataset: A collec- tion of multi-category images for class-agnostic object counting

Ciampi, L., Pacini, G., Messina, N., Amato, G., Falchi, F., 2026b. Mucca (multi-category class-agnostic counting) dataset: A collec- tion of multi-category images for class-agnostic object counting. URL:https://doi.org/10.5281/zenodo.19231375,doi:10.5281/zenodo. 19231375

work page doi:10.5281/zenodo.19231375
[12]

beyond code snippets: Benchmarking llms on repository-level question answering

Ciampi,L.,Santiago,C.,Costeira,J.,Gennaro,C.,Amato,G.,2022c. Night and Day Instance Segmented Park (NDISPark) Dataset: a Col- lection of Images taken by Day and by Night for Vehicle Detection, Segmentation and Counting in Parking Areas. doi:10.5281/zenodo. 6560823

work page doi:10.5281/zenodo
[13]

Ciampi,L.,Santiago,C.,Costeira,J.P.,Gennaro,C.,Amato,G.,2021. Domain adaptation for traffic density estimation, in: Proceedings of the 16th International Joint Conference on Computer Vision, Imag- ing and Computer Graphics Theory and Applications, VISIGRAPP 2021, Volume 5: VISAPP, Online Streaming, February 8-10, 2021, SCITEPRESS. pp. 185–195

2021
[14]

A deep learning-based pipeline for whitefly pest abundance estimation on chromotropic sticky traps

Ciampi, L., Zeni, V., Incrocci, L., Canale, A., Benelli, G., Falchi, F., Amato, G., Chessa, S., 2023a. A deep learning-based pipeline for whitefly pest abundance estimation on chromotropic sticky traps. Ecol. Informatics 78, 102384. doi:10.1016/J.ECOINF.2023.102384

work page doi:10.1016/j.ecoinf.2023.102384 2023
[15]

Pest sticky traps: a dataset for whitefly pest population density estimation in chromotropic sticky traps

Ciampi, L., Zeni, V., Incrocci, L., Canale, A., Benelli, G., Falchi, F., Amato, G., Chessa, S., 2023b. Pest sticky traps: a dataset for whitefly pest population density estimation in chromotropic sticky traps. doi:10.5281/zenodo.7801239

work page doi:10.5281/zenodo.7801239
[16]

Cohen, J.P., Boucher, G., Glastonbury, C.A., Lo, H.Z., Bengio, Y.,
[17]

Count-ception: Counting by fully convolutional redundant counting, in: 2017 IEEE International Conference on Computer Vi- sion Workshops (ICCVW), pp. 18–26. doi:10.1109/ICCVW.2017.9

work page doi:10.1109/iccvw.2017.9 2017
[19]

T-VSL: text-guided visual sound source localization in mixtures

Dai,S.,Liu,J.,Cheung,N.M.,2024b. ReferringExpressionCounting , in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society, Los Alamitos, CA, USA. pp. 16985–16995. doi:10.1109/CVPR52733.2024.01607

work page doi:10.1109/cvpr52733.2024.01607 2024
[20]

Afreeca: Annotation-free counting for all, in: Computer Vision - G

D’Alessandro, A.C., Mahdavi-Amiri, A., Hamarneh, G., 2024. Afreeca: Annotation-free counting for all, in: Computer Vision - G. Pacini et al.:Preprint submitted to ElsevierPage 18 of 20 Does it Really Count? ECCV2024-18thEuropeanConference,Milan,Italy,September29- October 4, 2024, Proceedings, Part IV, Springer. pp. 75–91. doi:10. 1007/978-3-031-73235-5\_5

2024
[21]

Semantic generative augmentations for few-shot counting, in: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2024, Waikoloa, HI, USA, January 3-8, 2024, IEEE

Doubinsky, P., Audebert, N., Crucianu, M., Borgne, H.L., 2024. Semantic generative augmentations for few-shot counting, in: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2024, Waikoloa, HI, USA, January 3-8, 2024, IEEE. pp. 5431–5440. doi:10.1109/WACV57701.2024.00536

work page doi:10.1109/wacv57701.2024.00536 2024
[22]

Sadler and Jiaman Wu and Wei

Dukic, N., Lukezic, A., Zavrtanik, V., Kristan, M., 2023. A low- shot object counting network with iterative prototype adaptation, in: ICCV, Paris, France, October 1-6, 2023, IEEE. pp. 18826–18835. doi:10.1109/ICCV51070.2023.01730

work page doi:10.1109/iccv51070.2023.01730 2023
[23]

Adaptive and background- aware match for class-agnostic counting

Gong, S., Yang, J., Zhang, S., 2025. Adaptive and background- aware match for class-agnostic counting. IEEE Signal Process. Lett. 32, 1261–1265. URL:https://doi.org/10.1109/LSP.2025.3546891, doi:10.1109/LSP.2025.3546891

work page doi:10.1109/lsp.2025.3546891 2025
[24]

Guerrero-Gómez-Olmedo,R.,Torre-Jiménez,B.,López-Sastre,R.J., Maldonado-Bascón, S., Oñoro-Rubio, D., 2015. Extremely overlap- ping vehicle counting, in: Pattern Recognition and Image Analysis - 7th Iberian Conference, IbPRIA 2015, Santiago de Compostela, Spain,June17-19,2015,Proceedings,Springer.pp.423–431. doi:10. 1007/978-3-319-19390-8\_48

2015
[25]

Learning to count anything: Reference-lessclass-agnosticcountingwithweaksupervision

Hobley, M.A., Prisacariu, V., 2022. Learning to count anything: Reference-lessclass-agnosticcountingwithweaksupervision. CoRR abs/2205.10203. doi:10.48550/ARXIV.2205.10203,arXiv:2205.10203

work page doi:10.48550/arxiv.2205.10203 2022
[26]

Hobley,M.A.,Prisacariu,V.,2024. ABCeasyas123:Ablindcounter for exemplar-free multi-class class-agnostic counting, in: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XI, Springer. pp. 304–319. doi:10.1007/978-3-031-73247-8\_18

work page doi:10.1007/978-3-031-73247-8 2024
[27]

Crowdcountingusingscale-awareattentionnetworks,in:IEEEWin- ter Conference on Applications of Computer Vision, WACV 2019, Waikoloa Village, HI, USA, January 7-11, 2019, IEEE

Hossain, M.A., Hosseinzadeh, M., Chanda, O., Wang, Y., 2019. Crowdcountingusingscale-awareattentionnetworks,in:IEEEWin- ter Conference on Applications of Computer Vision, WACV 2019, Waikoloa Village, HI, USA, January 7-11, 2019, IEEE. pp. 1280–

2019
[28]

doi:10.1109/WACV.2019.00141

work page doi:10.1109/wacv.2019.00141 2019
[29]

Huang, Z., Dai, M., Zhang, Y., Zhang, J., Shan, H., 2024. Point, segment and count: A generalized framework for object counting, in: IEEE/CVFConferenceonComputerVisionandPatternRecognition, CVPR2024,Seattle,WA,USA,June16-22,2024,IEEE.pp.17067– 17076. doi:10.1109/CVPR52733.2024.01615

work page doi:10.1109/cvpr52733.2024.01615 2024
[30]

Idrees, H., Tayyab, M., Athrey, K., Zhang, D., Al-Máadeed, S., Rajpoot, N.M., Shah, M., 2018. Composition loss for counting, den- sity map estimation and localization in dense crowds, in: Computer Vision-ECCV2018-15thEuropeanConference,Munich,Germany, September 8-14, 2018, Proceedings, Part II, Springer. pp. 544–559. doi:10.1007/978-3-030-01216-8\_33

work page doi:10.1007/978-3-030-01216-8 2018
[31]

Jiang, R., Liu, L., Chen, C., 2023. Clip-count: Towards text-guided zero-shot object counting, in: Proceedings of the 31st ACM Interna- tionalConferenceonMultimedia,MM2023,Ottawa,ON,Canada,29 October2023-3November2023,ACM.pp.4535–4545.URL:https: //doi.org/10.1145/3581783.3611789, doi:10.1145/3581783.3611789

work page doi:10.1145/3581783.3611789 2023
[32]

Kang, S., Moon, W., Kim, E., Heo, J., 2024. Vlcounter: Text-aware visual representation for zero-shot object counting, in: Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty- Sixth Conference on Innovative Applications of Artificial Intelli- gence, IAAI 2024, Fourteenth Symposium on Educational Advances inArtificialIntelligence,EA...

work page doi:10.1609/aaai.v38i3 2024
[33]

Revisiting crowd counting:State-of-the-art,trends,andfutureperspectives

Khan, M.A., Menouar, H., Hamila, R., 2023. Revisiting crowd counting:State-of-the-art,trends,andfutureperspectives. ImageVis. Comput. 129, 104597. doi:10.1016/J.IMAVIS.2022.104597

work page doi:10.1016/j.imavis.2022.104597 2023
[34]

Kirillov,A.,Mintun,E.,Ravi,N.,Mao,H.,Rolland,C.,Gustafson,L., Xiao,T.,Whitehead,S.,Berg,A.C.,Lo,W.Y.,Dollár,P.,Girshick,R.,
[35]

Segment anything.arXiv:2304.02643

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Lempitsky, V.S., Zisserman, A., 2010. Learning to count objects in images, in: Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010.Proceedingsofameetingheld6-9December2010,Vancouver, British Columbia, Canada, Curran Associates, Inc.. pp. 1324–1332

2010
[37]

Microsoft COCO: common objects in context, in: Computer Vision - ECCV 2014 - 13th European Confer- ence, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, Springer

Lin, T., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L., 2014. Microsoft COCO: common objects in context, in: Computer Vision - ECCV 2014 - 13th European Confer- ence, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, Springer. pp. 740–755. doi:10.1007/978-3-319-10602-1\_48

work page doi:10.1007/978-3-319-10602-1 2014
[38]

Lin,W.,Chan,A.B.,2024. Afixed-pointapproachtounifiedprompt- based counting, in: Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Sym- posium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver,...

work page doi:10.1609/aaai.v38i4.28134 2024
[39]

Countr: Transformer-based generalised visual counting, in: 33rd British Ma- chineVisionConference2022,BMVC2022,London,UK,November 21-24, 2022, BMVA Press

Liu, C., Zhong, Y., Zisserman, A., Xie, W., 2022. Countr: Transformer-based generalised visual counting, in: 33rd British Ma- chineVisionConference2022,BMVC2022,London,UK,November 21-24, 2022, BMVA Press. p. 370

2022
[40]

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L., 2024. Grounding DINO: marryingDINOwithgroundedpre-trainingforopen-setobjectdetec- tion,in:ComputerVision-ECCV2024-18thEuropeanConference, Milan,Italy,September29-October4,2024,Proceedings,PartXLVII, Springer. pp. 38–55. doi:10.1007/978-3-031-7...

work page doi:10.1007/978-3-031-72970-6 2024
[41]

Context-awarecrowdcounting, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, Computer Vision Foundation / IEEE

Liu,W.,Salzmann,M.,Fua,P.,2019. Context-awarecrowdcounting, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, Computer Vision Foundation / IEEE. pp. 5099–5108. doi:10.1109/CVPR.2019. 00524

work page doi:10.1109/cvpr.2019 2019
[42]

Mondal, A., Nag, S., Zhu, X., Dutta, A., 2025. Omnicount: Multi- label object counting with semantic-geometric priors, in: AAAI- 25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, AAAI Press. pp. 19537–19545. doi:10.1609/AAAI.V39I18.34151

work page doi:10.1609/aaai.v39i18.34151 2025
[43]

Norouzzadeh,M.S.,Nguyen,A.,Kosmala,M.,Swanson,A.,Palmer, M.S.,Packer,C.,Clune,J.,2018.Automaticallyidentifying,counting, and describing wild animals in camera-trap images with deep learn- ing. Proc. Natl. Acad. Sci. USA 115, E5716–E5725. doi:10.1073/ PNAS.1719367115

2018
[44]

Dave - a detect-and-verify paradigm for low-shot counting, in: Proceedings of theIEEE/CVFConferenceonComputerVisionandPatternRecogni- tion (CVPR), pp

Pelhan, J., Lukeži?, A., Zavrtanik, V., Kristan, M., 2024. Dave - a detect-and-verify paradigm for low-shot counting, in: Proceedings of theIEEE/CVFConferenceonComputerVisionandPatternRecogni- tion (CVPR), pp. 23293–23302

2024
[45]

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agar- wal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever,I.,2021. Learningtransferablevisualmodelsfromnatural language supervision, in: Proceedings of the 38th International Con- ference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, PMLR. pp. 8748–8763

2021
[46]

Learning to count everything, in: IEEE Conference on Computer Vision and Pat- tern Recognition, CVPR 2021, virtual, June 19-25, 2021, Computer Vision Foundation / IEEE

Ranjan, V., Sharma, U., Nguyen, T., Hoai, M., 2021. Learning to count everything, in: IEEE Conference on Computer Vision and Pat- tern Recognition, CVPR 2021, virtual, June 19-25, 2021, Computer Vision Foundation / IEEE. pp. 3394–3403. doi:10.1109/CVPR46437. 2021.00340

work page doi:10.1109/cvpr46437 2021
[47]

Acomputervisionbasedvehicledetectionand counting system, in: 8th International Conference on Knowledge and Smart Technology, KST 2016, Chiangmai, Thailand, February 3-6, 2016, IEEE

Seenouvong, N., Watchareeruetai, U., Nuthong, C., Khongsomboon, K.,Ohnishi,N.,2016. Acomputervisionbasedvehicledetectionand counting system, in: 8th International Conference on Knowledge and Smart Technology, KST 2016, Chiangmai, Thailand, February 3-6, 2016, IEEE. pp. 224–227. doi:10.1109/KST.2016.7440510

work page doi:10.1109/kst.2016.7440510 2016
[48]

Semantic generative augmentations for few-shot counting, in: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2024, Waikoloa, HI, USA, January 3-8, 2024, IEEE

Shi,Z.,Sun,Y.,Zhang,M.,2024. Training-freeobjectcountingwith prompts, in: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), IEEE Computer Society, Los Alamitos, CA, USA. pp. 322–330. doi:10.1109/WACV57701.2024.00039

work page doi:10.1109/wacv57701.2024.00039 2024
[49]

https://doi.org/10.1016/j

Wang, L., Li, J., Qi, C., Wu, X., Zou, R., Wang, F., Wang, P., 2025. A neighbor-aware feature enhancement network for crowd counting. Image Vis. Comput. 159, 105578. URL:https://doi.org/10.1016/j. imavis.2025.105578, doi:10.1016/J.IMAVIS.2025.105578. G. Pacini et al.:Preprint submitted to ElsevierPage 19 of 20 Does it Really Count?

work page doi:10.1016/j 2025
[50]

Wang, Z., Xiao, L., Cao, Z., Lu, H., 2024. Vision transformer off- the-shelf: A surprising baseline for few-shot class-agnostic counting, in: Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artifi- cial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artifi...

2024
[51]

Sqlnet: Scale-modulated query and localization network for few-shot class-agnostic counting

Wu, H., Chen, Y., Liu, L., Chen, T., Wang, K., Lin, L., 2025. Sqlnet: Scale-modulated query and localization network for few-shot class-agnostic counting. IEEE Trans. Image Process. 34, 4631–

2025
[52]

URL:https://doi.org/10.1109/TIP.2025.3588255,doi:10.1109/ TIP.2025.3588255

work page doi:10.1109/tip.2025.3588255 2025
[53]

Microscopy cell counting and detection with fully convolutional regression networks

Xie, W., Noble, J.A., Zisserman, A., 2018. Microscopy cell counting and detection with fully convolutional regression networks. Comput. methods Biomech. Biomed. Eng. Imaging Vis. 6, 283–292. doi:10. 1080/21681163.2016.1149104

work page arXiv 2018
[54]

Vision transformers are parameter- efficient audio-visual learners

Xu, J., Le, H., Nguyen, V., Ranjan, V., Samaras, D., 2023. Zero- shot object counting, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, IEEE. pp. 15548–15557. doi:10.1109/CVPR52729.2023. 01492

work page doi:10.1109/cvpr52729.2023 2023
[55]

CREAM: few-shot object counting with cross refinement and adaptive density map

Xu, Y., Li, M., Ye, Q., Wang, S., Li, L., Zhang, H., 2025. CREAM: few-shot object counting with cross refinement and adaptive density map. Image Vis. Comput. 161, 105632. URL:https://doi.org/10. 1016/j.imavis.2025.105632, doi:10.1016/J.IMAVIS.2025.105632

work page doi:10.1016/j.imavis.2025.105632 2025
[56]

Learning spatial similarity distribution for few-shot object counting

Xu, Y., Song, F., Zhang, H., 2024. Learning spatial similarity distribution for few-shot object counting. CoRR abs/2405.11770. doi:10.48550/ARXIV.2405.11770,arXiv:2405.11770

work page doi:10.48550/arxiv.2405.11770 2024
[57]

Zhang, S., Wu, G., Costeira, J.P., Moura, J.M.F., 2017. Fcn-rlstm: Deep spatio-temporal neural networks for vehicle counting in city cameras, in: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, IEEE Computer Society. pp. 3687–3696. doi:10.1109/ICCV.2017.396

work page doi:10.1109/iccv.2017.396 2017
[58]

Cfenet: Context-aware feature enhancement network for efficient few-shot object counting

Zhang, S., Zhai, G., Chen, K., Wang, H., Han, S., 2025. Cfenet: Context-aware feature enhancement network for efficient few-shot object counting. Image Vis. Comput. 154, 105383. URL:https: //doi.org/10.1016/j.imavis.2024.105383,doi:10.1016/J.IMAVIS.2024. 105383

work page doi:10.1016/j.imavis.2024.105383 2025
[59]

Enhanced crowd counting with weighted attention network and multi-scale feature integration

Zhou, L., Hu, Z., 2025. Enhanced crowd counting with weighted attention network and multi-scale feature integration. Image Vis. Comput. 163, 105750. URL:https://doi.org/10.1016/j.imavis. 2025.105750, doi:10.1016/J.IMAVIS.2025.105750

work page doi:10.1016/j.imavis 2025
[60]

Multi-branch pro- gressiveembeddingnetworkforcrowdcounting

Zhou, L., Rao, S., Li, W., Hu, B., Sun, B., 2024. Multi-branch pro- gressiveembeddingnetworkforcrowdcounting. ImageVis.Comput. 148, 105140. URL:https://doi.org/10.1016/j.imavis.2024.105140, doi:10.1016/J.IMAVIS.2024.105140. G. Pacini et al.:Preprint submitted to ElsevierPage 20 of 20

work page doi:10.1016/j.imavis.2024.105140 2024