RT-Counter: Real-Time Text-Guided Open-Vocabulary Object Counting

Hao-Yuan Ma; Jie Gao; Li Zhang; Zhiwei Zhu

arxiv: 2606.17561 · v1 · pith:QRXZNQSJnew · submitted 2026-06-16 · 💻 cs.CV

RT-Counter: Real-Time Text-Guided Open-Vocabulary Object Counting

Hao-Yuan Ma , Li Zhang , Zhiwei Zhu , Jie Gao This is my paper

Pith reviewed 2026-06-27 01:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords object countingopen-vocabulary countingtext-guided countingreal-time inferencevision-language modelhybrid attentionvisual prototypetransformer layer

0 comments

The pith

RT-Counter projects visual features into text space and weaves local-global attention to count text-described objects at real-time speeds with competitive accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework for text-guided open-vocabulary object counting that aims to deliver both strong accuracy and real-time performance. It introduces a Visual Prototype Textualization module to move visual features into text feature space, creating representations that combine abstract information difficult for visual prototypes with detailed information hard to express in text. It pairs this with Weaving Transformer layers that apply a hybrid attention mechanism to combine local and global visual features efficiently. Experiments on standard datasets demonstrate that the approach reaches a mean absolute error of 13.30 on FSC147 while running at 112.48 frames per second, exceeding the speed and parameter efficiency of prior leading methods. A sympathetic reader would care because previous text-guided counting systems forced a choice between accuracy and the speed needed for live applications.

Core claim

The paper claims that the Visual Prototype Textualization module, which projects learned visual features into text feature space to produce features containing both abstract information hard to capture with visual prototypes and detailed prototype information difficult to describe in text, combined with Weaving Transformer layers that use a novel hybrid attention mechanism to weave local and global visual features, enables accurate text-guided open-vocabulary object counting at real-time speeds, as shown by achieving a competitive MAE of 13.30 on FSC147 while operating at 112.48 FPS and using over 4 times fewer parameters than leading prior methods.

What carries the argument

Visual Prototype Textualization (VPT) module that projects visual features into text space and Weaving Transformer (Weaformer) layers with hybrid attention that efficiently combine local and global features.

If this is right

Text-specified counting becomes feasible in live video streams without requiring category-specific retraining.
The reduced parameter count allows deployment on devices with limited memory while retaining open-vocabulary flexibility.
Hybrid attention layers of this form can replace heavier transformer blocks in other vision-language counting pipelines.
Results across three public datasets indicate the design generalizes beyond the primary FSC147 benchmark.
Real-time performance removes the previous need to trade accuracy for speed in text-guided counting scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same visual-to-text projection step could be tested in related tasks such as open-vocabulary detection or segmentation to see whether the accuracy-speed balance transfers.
If the hybrid attention proves stable, it offers a template for lowering compute in other vision transformers that mix local and global cues.
Evaluating the method on text descriptions that contain negation or spatial relations not present in current benchmarks would expose any untested limits of the textualization step.
Combining the framework with larger frozen vision-language backbones might increase accuracy further while preserving the reported speed advantage.

Load-bearing premise

The projection of visual features into text space by the VPT module actually succeeds in enhancing object-level counting by supplying both abstract and detailed information.

What would settle it

Measuring the model on the FSC147 test set and obtaining a mean absolute error above 13.30 or a frame rate below 100 FPS on comparable hardware would falsify the performance claims.

Figures

Figures reproduced from arXiv: 2606.17561 by Hao-Yuan Ma, Jie Gao, Li Zhang, Zhiwei Zhu.

**Figure 2.** Figure 2: Overview of RT-Counter, where snowflake symbols mean that parameters are frozen, and flame symbols denote that parameters [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Detailed Architectures of VPT and feature enhancer. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The counting results on FSC147. These results highlight RT-Counter’s robust generalization across different object categories, scales, and environmental conditions. The model successfully handles challenging scenarios including object overlapping, background clutters, and various lighting conditions. 4.3. Ablation Studies [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Text-guided open-vocabulary object counting (TOOC) aims to count objects belonging to the categories specified by natural language descriptions. Although vision-language pre-trained models have been successful applied to TOOC tasks, they still struggle with fine-grained spatial understanding and real-time inference requirements in counting scenarios. To address these limitations, this paper proposes a real-time TOOC framework, called the Real-Time Counter (RT-Counter), that achieves not only good counting accuracy but also high computational efficiency. RT-Counter designs a novel Visual Prototype Textualization (VPT) module that can project learned visual features into a text feature space and then generate features containing the abstract information that is hard to capture with visual prototypes and the detailed prototype information that is difficult to describe in text, enhancing the object-level visual-language model's counting capabilities. Additionally, RT-Counter incorporates our Weaving Transformer (Weaformer) layers, maintaining high descriptive power at a fraction of the computational cost. The Weaformer layer adopts a novel hybrid attention mechanism that can efficiently weave together local and global visual features. Extensive experiments on three public datasets show that RT-Counter successfully breaks the accuracy-speed trade-off in TOOC. While achieving a competitive MAE of 13.30 on FSC147, RT-Counter operates at 112.48 FPS, making it 7.4x faster and over 4$\times$ more parameter-efficient than the existing leading methods in TOOC. Our work aims at balancing high accuracy and real-time performance in TOOC. Code is available at: https://github.com/Jason-Mar1/RT-Counter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RT-Counter adds VPT and Weaformer to reach real-time FPS on text-guided counting while matching prior MAE numbers, with code released to support the empirical claims.

read the letter

The key thing here is a practical system that gets text-guided open-vocabulary counting to run at 112 FPS with an MAE of 13.30 on FSC147, which is 7.4 times faster and over 4 times smaller than the leading prior methods. The two new pieces are the Visual Prototype Textualization module that maps visual features into text space and the Weaformer layers that mix local and global attention at lower cost.

The work does a few things right. It targets a clear engineering gap between accuracy and speed in TOOC, runs experiments on three public datasets, and ships the code on GitHub. That last part matters because it turns the speed and parameter claims into something others can check instead of taking on faith. The hybrid attention idea in Weaformer is a straightforward way to keep descriptive power without full quadratic cost, and the overall framing stays focused on the real-time requirement.

The soft spots are mostly about missing detail rather than outright problems. The abstract gives the headline numbers but does not include the ablation tables or exact equations for how VPT projects features, so it is not yet clear how much each module drives the gains versus careful tuning or baseline choices. If the full paper shows those breakdowns and the improvements hold under the same evaluation protocol, the central trade-off claim looks solid. No circular fitting or hidden assumptions jump out from the reported setup.

This is aimed at people building efficient vision-language pipelines for counting in robotics or monitoring. It deserves a serious referee because the task is well scoped, the performance targets are concrete, and the code release gives a path to verification. I would send it for review.

Referee Report

3 major / 2 minor

Summary. The paper proposes RT-Counter, a real-time framework for text-guided open-vocabulary object counting (TOOC). It introduces a Visual Prototype Textualization (VPT) module that projects visual features into text space to capture abstract and detailed prototype information, and Weaving Transformer (Weaformer) layers using a hybrid attention mechanism to weave local and global features efficiently. The central empirical claim is that RT-Counter achieves a competitive MAE of 13.30 on FSC147 while running at 112.48 FPS, making it 7.4x faster and over 4x more parameter-efficient than leading TOOC methods, with public code released.

Significance. If the reported trade-off holds under verification, the work would be significant for practical TOOC applications by demonstrating that real-time inference is achievable without major accuracy loss. The public code repository strengthens reproducibility and allows direct testing of the VPT and Weaformer components.

major comments (3)

[Abstract / §3 (method)] Abstract and method description: The VPT module is asserted to 'project learned visual features into a text feature space' and generate features with abstract/detailed information, but no equations, pseudocode, or feature-dimension details are supplied; this is load-bearing for the claim that it enhances the object-level visual-language model's counting capabilities beyond standard VLMs.
[Abstract / §4 (Weaformer)] Abstract and experiments: The Weaformer is claimed to maintain 'high descriptive power at a fraction of the computational cost' via hybrid attention, yet the abstract provides neither complexity analysis (FLOPs, parameters) nor ablation results comparing it to standard attention or prior TOOC backbones; without these, the 7.4x FPS and 4x parameter-efficiency gains cannot be independently assessed.
[Experiments] Experiments section: Only a single MAE point (13.30) and FPS value (112.48) are stated for FSC147 with no error bars, multiple runs, or dataset-split details; this weakens the cross-method comparison and the assertion that the accuracy-speed trade-off is broken.

minor comments (2)

[Abstract] The abstract mentions 'three public datasets' but reports quantitative results only for FSC147; the other two should be summarized with at least MAE/FPS numbers in the abstract or a table to support the general claim.
[Method] Notation for VPT and Weaformer components should be introduced with consistent symbols once the full equations appear, to avoid ambiguity when comparing to prior VLMs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the clarity of our method and experimental reporting. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract / §3 (method)] Abstract and method description: The VPT module is asserted to 'project learned visual features into a text feature space' and generate features with abstract/detailed information, but no equations, pseudocode, or feature-dimension details are supplied; this is load-bearing for the claim that it enhances the object-level visual-language model's counting capabilities beyond standard VLMs.

Authors: We agree that the VPT description would be strengthened by explicit mathematical details. In the revised manuscript we will add the projection equations, feature dimensions, and pseudocode for the VPT module in Section 3. revision: yes
Referee: [Abstract / §4 (Weaformer)] Abstract and experiments: The Weaformer is claimed to maintain 'high descriptive power at a fraction of the computational cost' via hybrid attention, yet the abstract provides neither complexity analysis (FLOPs, parameters) nor ablation results comparing it to standard attention or prior TOOC backbones; without these, the 7.4x FPS and 4x parameter-efficiency gains cannot be independently assessed.

Authors: We will add FLOPs and parameter counts for Weaformer versus baselines, plus ablation results on the hybrid attention mechanism, to the experiments section of the revision. revision: yes
Referee: [Experiments] Experiments section: Only a single MAE point (13.30) and FPS value (112.48) are stated for FSC147 with no error bars, multiple runs, or dataset-split details; this weakens the cross-method comparison and the assertion that the accuracy-speed trade-off is broken.

Authors: The reported values use the standard FSC147 test split. We will include error bars from multiple runs, state the splits explicitly, and expand the comparison table in the revised experiments section. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical contribution that introduces architectural modules (VPT and Weaformer) and validates them via training and evaluation on standard public benchmarks (FSC147 and two others). Reported figures (MAE 13.30, 112.48 FPS) are measured outcomes, not quantities derived from equations or parameters that are themselves fitted to the same target metrics. No derivation chain, uniqueness theorem, or self-citation is invoked to justify the central performance claim; the work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; standard deep-learning assumptions (gradient descent, transformer attention) are implicit but not enumerated.

pith-pipeline@v0.9.1-grok · 5828 in / 1128 out tokens · 25589 ms · 2026-06-27T01:27:43.964590+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references

[1]

Amini-Naieni, K

N. Amini-Naieni, K. Amini-Naieni, T. Han, and A. Zisser- man. Open-world text-specified object counting. InBritish Machine Vision Conference (BMCV), 2023. 2, 6, 7, 8

2023
[2]

Amini-Naieni, T

N. Amini-Naieni, T. Han, and A. Zisserman. Countgd: Multi-modal open-world counting. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 2, 6, 7

2024
[3]

Referring ex- pression counting

Siyang Dai, Jun Liu, and Ngai-Man Cheung. Referring ex- pression counting. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),
[4]

A low-shot object counting network with iterative prototype adaptation

Nikola Djukic, Alan Lukezic, Vitjan Zavrtanik, and Matej Kristan. A low-shot object counting network with iterative prototype adaptation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (CVPR), pages 18872–18881, 2023. 1, 2, 6

2023
[5]

YOLOX: Exceeding YOLO series in 2021

Zheng Ge, Songtao Liu, Feng02 Wang, Zeming Li, and Jian Sun. YOLOX: Exceeding YOLO series in 2021. 2021. 1

2021
[6]

Learning to count anything: Reference-less class-agnostic counting with weak supervision.arXiv preprint arXiv:2205.10203, 2022

Michael Hobley and Victor Prisacariu. Learning to count anything: Reference-less class-agnostic counting with weak supervision.arXiv preprint arXiv:2205.10203, 2022. 6

arXiv 2022
[7]

Drone- based object counting by spatially regularized regional pro- posal network

Meng-Ru Hsieh, Yen-Liang Lin, and Winston H Hsu. Drone- based object counting by spatially regularized regional pro- posal network. InProceedings of the IEEE international conference on computer vision (ICCV), 2017. 5

2017
[8]

Clip- count: Towards text-guided zero-shot object counting.arXiv preprint arXiv:2305.07304, 2023

Ruixiang Jiang, Lingbo Liu, and Changwen Chen. Clip- count: Towards text-guided zero-shot object counting.arXiv preprint arXiv:2305.07304, 2023. 1, 2, 6, 7

arXiv 2023
[9]

Vlcounter: Text-aware visual representation for zero- shot object counting

Seunggu Kang, WonJun Moon, Euiyeon Kim, and Jae-Pil Heo. Vlcounter: Text-aware visual representation for zero- shot object counting. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024. 2, 6, 7

2024
[10]

Countr: Transformer-based generalised visual count- ing.arXiv preprint arXiv:2208.13721, 2022

Chang Liu, Yujie Zhong, Andrew Zisserman, and Weidi Xie. Countr: Transformer-based generalised visual count- ing.arXiv preprint arXiv:2208.13721, 2022. 1, 2, 6

arXiv 2022
[11]

Path aggregation network for instance segmentation

Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. InPro- ceedings of IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), 2018. 3

2018
[12]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection.European confer- ence on computer vision (ECCV), 2024

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.European confer- ence on computer vision (ECCV), 2024. 2, 6, 7

2024
[13]

Class-agnostic counting

Erika Lu, Weidi Xie, and Andrew Zisserman. Class-agnostic counting. InAsian Conference on Computer Vision (ACCV), pages 669–684, 2019. 6

2019
[14]

Fgenet: Fine- grained extraction network for congested crowd counting

Hao-Yuan Ma, Li Zhang, and Xiang-Yi Wei. Fgenet: Fine- grained extraction network for congested crowd counting. In Proceedings of the 30th International Conference on Multi- media Modeling (MMM), 2024. 1

2024
[15]

A novel unified architecture for low-shot counting by detection and segmentation.Advances in Neural Informa- tion Processing Systems (NeurIPS), 2024

Jer Pelhan, Alan Lukezic, Vitjan Zavrtanik, and Matej Kris- tan. A novel unified architecture for low-shot counting by detection and segmentation.Advances in Neural Informa- tion Processing Systems (NeurIPS), 2024. 1, 2, 6, 7

2024
[16]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational conference on machine learning (ICME), pages 8748–8763, 2021. 2

2021
[17]

Learning to count everything

Viresh Ranjan, Udbhav Sharma, Thu Nguyen, and Minh Hoai. Learning to count everything. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3394–3403, 2021. 1, 2, 5, 6

2021
[18]

LAION-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa R Kundurthy, Kather- ine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text...

2022
[19]

Rethinking Counting and Localization in Crowds:A Purely Point-Based Framework

Qingyu Song, Changan Wang, Zhengkai Jiang, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yang Wu. Rethinking Counting and Localization in Crowds:A Purely Point-Based Framework. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 3365–3374, 2021. 1

2021
[20]

Exploring contextual at- tribute density in referring expression counting, 2025

Zhicheng Wang, Zhiyu Pan, Zhan Peng, Jian Cheng, Liwen Xiao, Wei Jiang, and Zhiguo Cao. Exploring contextual at- tribute density in referring expression counting, 2025. 6, 7

2025
[21]

Zero-shot object counting

Jingyi Xu, Hieu Le, Vu Nguyen, Viresh Ranjan, and Dim- itris Samaras. Zero-shot object counting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15548–15557, 2023. 3, 6, 7

2023
[22]

Yolo-facev2: A scale and occlusion aware face detector

Ziping Yu, Hongbo Huang, Weijun Chen, Yongxin Su, Yahui Liu, and Xiuying Wang. Yolo-facev2: A scale and occlusion aware face detector. https://arxiv.org/abs/2208.02019, 2022. 5

arXiv 2022
[23]

Single-Image Crowd Counting via Multi- Column Convolutional Neural Network

Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-Image Crowd Counting via Multi- Column Convolutional Neural Network. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 589–597, 2016. 1

2016
[24]

Zero-shot object counting with good exemplars

Huilin Zhu, Jingling Yuan, Zhengwei Yang, Yu Guo, Zheng Wang, Xian Zhong, and Shengfeng He. Zero-shot object counting with good exemplars. InProceedings of the Euro- pean Conference on Computer Vision (ECCV), 2024. 3 9

2024

[1] [1]

Amini-Naieni, K

N. Amini-Naieni, K. Amini-Naieni, T. Han, and A. Zisser- man. Open-world text-specified object counting. InBritish Machine Vision Conference (BMCV), 2023. 2, 6, 7, 8

2023

[2] [2]

Amini-Naieni, T

N. Amini-Naieni, T. Han, and A. Zisserman. Countgd: Multi-modal open-world counting. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 2, 6, 7

2024

[3] [3]

Referring ex- pression counting

Siyang Dai, Jun Liu, and Ngai-Man Cheung. Referring ex- pression counting. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),

[4] [4]

A low-shot object counting network with iterative prototype adaptation

Nikola Djukic, Alan Lukezic, Vitjan Zavrtanik, and Matej Kristan. A low-shot object counting network with iterative prototype adaptation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (CVPR), pages 18872–18881, 2023. 1, 2, 6

2023

[5] [5]

YOLOX: Exceeding YOLO series in 2021

Zheng Ge, Songtao Liu, Feng02 Wang, Zeming Li, and Jian Sun. YOLOX: Exceeding YOLO series in 2021. 2021. 1

2021

[6] [6]

Learning to count anything: Reference-less class-agnostic counting with weak supervision.arXiv preprint arXiv:2205.10203, 2022

Michael Hobley and Victor Prisacariu. Learning to count anything: Reference-less class-agnostic counting with weak supervision.arXiv preprint arXiv:2205.10203, 2022. 6

arXiv 2022

[7] [7]

Drone- based object counting by spatially regularized regional pro- posal network

Meng-Ru Hsieh, Yen-Liang Lin, and Winston H Hsu. Drone- based object counting by spatially regularized regional pro- posal network. InProceedings of the IEEE international conference on computer vision (ICCV), 2017. 5

2017

[8] [8]

Clip- count: Towards text-guided zero-shot object counting.arXiv preprint arXiv:2305.07304, 2023

Ruixiang Jiang, Lingbo Liu, and Changwen Chen. Clip- count: Towards text-guided zero-shot object counting.arXiv preprint arXiv:2305.07304, 2023. 1, 2, 6, 7

arXiv 2023

[9] [9]

Vlcounter: Text-aware visual representation for zero- shot object counting

Seunggu Kang, WonJun Moon, Euiyeon Kim, and Jae-Pil Heo. Vlcounter: Text-aware visual representation for zero- shot object counting. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024. 2, 6, 7

2024

[10] [10]

Countr: Transformer-based generalised visual count- ing.arXiv preprint arXiv:2208.13721, 2022

Chang Liu, Yujie Zhong, Andrew Zisserman, and Weidi Xie. Countr: Transformer-based generalised visual count- ing.arXiv preprint arXiv:2208.13721, 2022. 1, 2, 6

arXiv 2022

[11] [11]

Path aggregation network for instance segmentation

Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. InPro- ceedings of IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), 2018. 3

2018

[12] [12]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection.European confer- ence on computer vision (ECCV), 2024

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.European confer- ence on computer vision (ECCV), 2024. 2, 6, 7

2024

[13] [13]

Class-agnostic counting

Erika Lu, Weidi Xie, and Andrew Zisserman. Class-agnostic counting. InAsian Conference on Computer Vision (ACCV), pages 669–684, 2019. 6

2019

[14] [14]

Fgenet: Fine- grained extraction network for congested crowd counting

Hao-Yuan Ma, Li Zhang, and Xiang-Yi Wei. Fgenet: Fine- grained extraction network for congested crowd counting. In Proceedings of the 30th International Conference on Multi- media Modeling (MMM), 2024. 1

2024

[15] [15]

A novel unified architecture for low-shot counting by detection and segmentation.Advances in Neural Informa- tion Processing Systems (NeurIPS), 2024

Jer Pelhan, Alan Lukezic, Vitjan Zavrtanik, and Matej Kris- tan. A novel unified architecture for low-shot counting by detection and segmentation.Advances in Neural Informa- tion Processing Systems (NeurIPS), 2024. 1, 2, 6, 7

2024

[16] [16]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational conference on machine learning (ICME), pages 8748–8763, 2021. 2

2021

[17] [17]

Learning to count everything

Viresh Ranjan, Udbhav Sharma, Thu Nguyen, and Minh Hoai. Learning to count everything. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3394–3403, 2021. 1, 2, 5, 6

2021

[18] [18]

LAION-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa R Kundurthy, Kather- ine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text...

2022

[19] [19]

Rethinking Counting and Localization in Crowds:A Purely Point-Based Framework

Qingyu Song, Changan Wang, Zhengkai Jiang, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yang Wu. Rethinking Counting and Localization in Crowds:A Purely Point-Based Framework. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 3365–3374, 2021. 1

2021

[20] [20]

Exploring contextual at- tribute density in referring expression counting, 2025

Zhicheng Wang, Zhiyu Pan, Zhan Peng, Jian Cheng, Liwen Xiao, Wei Jiang, and Zhiguo Cao. Exploring contextual at- tribute density in referring expression counting, 2025. 6, 7

2025

[21] [21]

Zero-shot object counting

Jingyi Xu, Hieu Le, Vu Nguyen, Viresh Ranjan, and Dim- itris Samaras. Zero-shot object counting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15548–15557, 2023. 3, 6, 7

2023

[22] [22]

Yolo-facev2: A scale and occlusion aware face detector

Ziping Yu, Hongbo Huang, Weijun Chen, Yongxin Su, Yahui Liu, and Xiuying Wang. Yolo-facev2: A scale and occlusion aware face detector. https://arxiv.org/abs/2208.02019, 2022. 5

arXiv 2022

[23] [23]

Single-Image Crowd Counting via Multi- Column Convolutional Neural Network

Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-Image Crowd Counting via Multi- Column Convolutional Neural Network. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 589–597, 2016. 1

2016

[24] [24]

Zero-shot object counting with good exemplars

Huilin Zhu, Jingling Yuan, Zhengwei Yang, Yu Guo, Zheng Wang, Xian Zhong, and Shengfeng He. Zero-shot object counting with good exemplars. InProceedings of the Euro- pean Conference on Computer Vision (ECCV), 2024. 3 9

2024