Recognition: no theorem link
RS-OVC: Open-Vocabulary Counting for Remote-Sensing Data
Pith reviewed 2026-05-10 17:54 UTC · model grok-4.3
The pith
RS-OVC counts novel object classes in remote-sensing images using only text or visual prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RS-OVC is the first open-vocabulary counting model designed specifically for remote-sensing data. It demonstrates the ability to accurately count object classes that were not encountered during training, relying exclusively on conditioning from text prompts or visual examples.
What carries the argument
The RS-OVC model architecture, which supports open-vocabulary conditioning to enable counting of arbitrary classes in aerial imagery.
If this is right
- It removes the requirement for costly re-annotation when new object types need counting.
- The approach supports dynamic applications in environmental monitoring and urban planning.
- Both text-based and image-based prompts can be used interchangeably for conditioning the count.
- Performance holds for novel classes without additional training steps.
Where Pith is reading between the lines
- This opens the door to integrating such models into automated satellite analysis pipelines for ongoing surveillance.
- Similar techniques might apply to counting in other specialized imagery domains like medical or industrial inspection.
- Future work could test robustness across different sensor types or resolutions in remote sensing.
Load-bearing premise
The assumption that text or visual conditioning provides sufficient information for a model to count previously unseen object classes accurately in remote-sensing scenes.
What would settle it
Evaluating the model on a held-out dataset of remote-sensing images containing a new object class such as solar panels, and checking if the predicted counts match manual tallies within acceptable error margins.
Figures
read the original abstract
Object-Counting for remote-sensing (RS) imagery is attracting increasing research interest due to its crucial role in a wide and diverse set of applications. While several promising methods for RS object-counting have been proposed, existing methods focus on a closed, pre-defined set of object classes. This limitation necessitates costly re-annotation and model re-training to adapt current approaches for counting of novel objects that have not been seen during training, and severely inhibits their application in dynamic, real-world monitoring scenarios. To address this gap, in this work we propose RS-OVC - the first Open Vocabulary Counting (OVC) model for Remote-Sensing and aerial imagery. We show that our model is capable of accurate counting of novel object classes, that were unseen during training, based solely on textual and/or visual conditioning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes RS-OVC as the first open-vocabulary counting (OVC) model for remote-sensing and aerial imagery. It claims that this model can perform accurate counting of novel object classes unseen during training, relying solely on textual and/or visual conditioning to overcome the limitations of closed-set counting methods that require re-annotation and retraining.
Significance. Should the approach prove effective, it would offer a substantial advance in remote-sensing object counting by facilitating adaptation to new classes without retraining, which is particularly valuable for dynamic real-world monitoring applications. The concept directly targets a practical limitation in current RS counting techniques.
major comments (1)
- [Abstract] The manuscript consists solely of an abstract. The central claim that the model achieves 'accurate counting of novel object classes' via text/visual conditioning is presented without any methodological description, architecture details, training procedure, quantitative results, ablation studies, or validation on novel classes, making the claim impossible to evaluate or verify.
Simulated Author's Rebuttal
We thank the referee for their review and for recognizing the potential significance of open-vocabulary counting for remote-sensing applications. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract] The manuscript consists solely of an abstract. The central claim that the model achieves 'accurate counting of novel object classes' via text/visual conditioning is presented without any methodological description, architecture details, training procedure, quantitative results, ablation studies, or validation on novel classes, making the claim impossible to evaluate or verify.
Authors: We agree that the version provided for review contains only the abstract and therefore lacks the requested details, making independent evaluation impossible at this stage. The full manuscript (arXiv:2604.08704) includes dedicated sections on the RS-OVC architecture (a vision-language backbone adapted with a counting head), the training procedure (closed-set pre-training followed by open-vocabulary fine-tuning with text and visual prompts), quantitative results on RS datasets with held-out novel classes, ablation studies on conditioning modalities, and explicit validation experiments measuring counting accuracy for unseen object categories. We will submit a revised manuscript that incorporates these sections in full so that the claims can be properly assessed. revision: yes
Circularity Check
No significant circularity detected
full rationale
The abstract and context contain no equations, derivations, fitted parameters, or load-bearing self-citations. The central claim is a proposal of a new model (RS-OVC) for open-vocabulary counting based on text/visual conditioning. No step reduces by construction to its inputs, renames a known result, or relies on an unverified self-citation chain. The derivation chain, if present in the full manuscript, cannot be examined for circularity from the given text, but nothing in the provided material exhibits the enumerated patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
NeurIPS37, 48810–48837 (2024)
Amini-Naieni, N., Han, T., Zisserman, A.: Countgd: Multi-modal open-world counting. NeurIPS37, 48810–48837 (2024)
2024
-
[2]
Angeoletto, F., Fellowes, M.D., Santos, J.W.: Counting brazil’s urban trees will help make brazil’s urban trees count. J. For.116(5), 489–490 (2018)
2018
-
[3]
In: ECCV
Arteta, C., Lempitsky, V., Zisserman, A.: Counting in the wild. In: ECCV. pp. 483–498. Springer (2016)
2016
-
[4]
Barrington,L.,Ghosh,S.,Greene,M.,Har-Noy,S.,Berger,J.,Gill,S.,Lin,A.Y.M., Huyck, C.: Crowdsourcing earthquake damage assessment using remote sensing imagery. Ann. Geophys.54(6) (2011)
2011
-
[5]
Remote Sens
Bernd, A., Braun, D., Ortmann, A., Ulloa-Torrealba, Y.Z., Wohlfart, C., Bell, A.: More than counting pixels–perspectives on the importance of remote sensing training in ecology and conservation. Remote Sens. Ecol. Conserv.3(1), 38–47 (2017)
2017
-
[6]
Remote Sens
Chen, G., Shang, Y.: Transformer for tree counting in aerial images. Remote Sens. 14(3), 476 (2022)
2022
-
[7]
Geo-spatial Inf
Chong, K.L., Kanniah, K.D., Pohl, C., Tan, K.P.: A review of remote sensing applications for oil palm studies. Geo-spatial Inf. Sci.20(2), 184–200 (2017)
2017
-
[8]
Australas
Dare, P., Fraser, C., Duthie, T.: Application of automated remote sensing tech- niques to dam counting. Australas. J. Water Resour.5(2), 195–208 (2002)
2002
-
[9]
In: NAACL
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: NAACL. pp. 4171–4186 (2019) 14 T. Shor et al
2019
-
[10]
IEEE TGRS 60, 1–11 (2022)
Ding, G., Cui, M., Yang, D., Wang, T., Wang, S., Zhang, Y.: Object counting for remote-sensing images via adaptive density map-assisted learning. IEEE TGRS 60, 1–11 (2022)
2022
-
[11]
IEEE TGRS60, 1–12 (2021)
Duan, Z., Wang, S., Di, H., Deng, J.: Distillation remote sensing object counting via multi-scale context feature aggregation. IEEE TGRS60, 1–12 (2021)
2021
-
[12]
IJCV88(2), 303–338 (2010)
Everingham,M.,VanGool,L.,Williams,C.K.,Winn,J.,Zisserman,A.:Thepascal visual object classes (voc) challenge. IJCV88(2), 303–338 (2010)
2010
-
[13]
Fan, Y., Wen, Q., Wang, W., Wang, P., Li, L., Zhang, P.: Quantifying disaster physical damage using remote sensing data—a technical work flow and case study of the 2014 ludian earthquake in china. Int. J. Disaster Risk Sci.8(4), 471–488 (2017)
2014
-
[14]
Farjon, G., Huijun, L., Edan, Y.: Deep-learning-based counting methods, datasets, and applications in agriculture: A review. Precis. Agric.24(5), 1683–1711 (2023)
2023
-
[15]
IEEE TGRS59(5), 3642–3655 (2020)
Gao, G., Liu, Q., Wang, Y.: Counting from sky: A large-scale data set for remote sensing object counting and a benchmark method. IEEE TGRS59(5), 3642–3655 (2020)
2020
-
[16]
IEEE TGRS62, 1–14 (2024)
Gao, J., Zhao, L., Li, X.: Nwpu-moc: A benchmark for fine-grained multicategory object counting in aerial images. IEEE TGRS62, 1–14 (2024)
2024
-
[17]
Garrido-Valenzuela, F., Cats, O., van Cranenburgh, S.: Where are the people? counting people in millions of street-level images to explore associations between people’s urban density and urban characteristics. Comput. Environ. Urban Syst. 102, 101971 (2023)
2023
-
[18]
IEEE TGRS62, 1–13 (2024)
Guo, H., Gao, J., Yuan, Y.: Balanced density regression network for remote sensing object counting. IEEE TGRS62, 1–13 (2024)
2024
-
[19]
Remote Sens.14(24), 6363 (2022)
Guo, X., Anisetti, M., Gao, M., Jeon, G.: Object counting in remote sensing via triple attention and scale-aware network. Remote Sens.14(24), 6363 (2022)
2022
-
[20]
Guo, Y., Wu, C., Du, B., Zhang, L.: Density map-based vehicle counting in remote sensing images with limited resolution. ISPRS J. Photogramm. Remote Sens.189, 201–217 (2022)
2022
-
[21]
Methods Ecol
Hollings, T., Burgman, M., van Andel, M., Gilbert, M., Robinson, T., Robinson, A.: How do you find the green sheep? a critical review of the use of remotely sensed imagery to detect and count animals. Methods Ecol. Evol.9(4), 881–892 (2018)
2018
-
[22]
ISPRS Int
Kızılkaya, S., Alganci, U., Sertel, E.: Vhrships: An extensive benchmark dataset for scalable deep learning-based ship detection applications. ISPRS Int. J. Geo-Inf. 11(8), 445 (2022)
2022
-
[23]
John Wiley & Sons (2009)
Klemelä, J.S.: Smoothing of multivariate data: density estimation and visualiza- tion. John Wiley & Sons (2009)
2009
-
[24]
IEEE TGRS61, 1–11 (2023)
Li, C., Cheng, G., Wang, G., Zhou, P., Han, J.: Instance-aware distillation for efficient object detection in remote sensing images. IEEE TGRS61, 1–11 (2023)
2023
-
[25]
Li, K., Wan, G., Cheng, G., Meng, L., Han, J.: Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens.159, 296–307 (2020)
2020
-
[26]
In: ECCV
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. pp. 740–755. Springer (2014)
2014
-
[27]
arXiv preprint arXiv:2501.06697 (2025)
Liu, P., Lei, S., Li, H.C.: Mamba-moc: A multicategory remote object counting via state space model. arXiv preprint arXiv:2501.06697 (2025)
-
[28]
In: ECCV
Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: ECCV. pp. 38–55. Springer (2024) RS-OVC: Open-Vocabulary Counting for Remote-Sensing Data 15
2024
-
[29]
In: ICCV
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV. pp. 10012–10022 (2021)
2021
-
[30]
Michelat, T., Hueber, N., Raymond, P., Pichler, A., Schaal, P., Dugaret, B.: Auto- matic pedestrian detection and counting applied to urban planning. In: AmI. pp. 285–289. Springer (2010)
2010
-
[31]
In: AAAI
Pan, J., Liu, Y., Fu, Y., Ma, M., Li, J., Paudel, D.P., Van Gool, L., Huang, X.: Locate anything on earth: Advancing open-vocabulary object detection for remote sensing community. In: AAAI. vol. 39, pp. 6281–6289 (2025)
2025
-
[32]
Park, J.J., Park, K.A., Kim, T.S., Oh, S., Lee, M.: Aerial hyperspectral remote sensing detection for maritime search and surveillance of floating small objects. Adv. Space Res.72(6), 2118–2136 (2023)
2023
-
[33]
Remote Sens.16(3), 557 (2024)
Reggiannini, M., Salerno, E., Bacciu, C., D’Errico, A., Lo Duca, A., Marchetti, A., Martinelli, M., Mercurio, C., Mistretta, A., Righi, M., et al.: Remote sensing for maritime traffic understanding. Remote Sens.16(3), 557 (2024)
2024
-
[34]
IEEE TPAMI39(6), 1137–1149 (2016)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object de- tection with region proposal networks. IEEE TPAMI39(6), 1137–1149 (2016)
2016
-
[35]
Saleh, S.A.M., Suandi, S.A., Ibrahim, H.: Recent survey on crowd density esti- mation and counting for visual surveillance. Eng. Appl. Artif. Intell.41, 103–114 (2015)
2015
-
[36]
Ecosystems25(8), 1719–1737 (2022)
Senf, C.: Seeing the system from above: The use and potential of remote sensing for studying ecosystem dynamics. Ecosystems25(8), 1719–1737 (2022)
2022
-
[37]
IEEE TCSVT (2024)
Shen, Z., Li, G., Xia, R., Meng, H., Huang, Z.: A lightweight object counting network based on density map knowledge distillation. IEEE TCSVT (2024)
2024
-
[38]
Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
In: RSETE
Song, H., Liu, X., Zhang, X., Hu, J.: Real-time monitoring for crowd counting using video surveillance and gis. In: RSETE. pp. 1–4. IEEE (2012)
2012
-
[40]
Sun, X., Wang, P., Yan, Z., Xu, F., Wang, R., Diao, W., Chen, J., Li, J., Feng, Y., Xu, T., et al.: Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 184, 116–130 (2022)
2022
-
[41]
Remote Sens.17, 3385 (2025)
Wang, S., Song, Y., Xiang, J., Chen, Y., Zhong, P., Fu, R.: Mask-guided teacher– student learning for open-vocabulary object detection in remote sensing images. Remote Sens.17, 3385 (2025)
2025
-
[42]
In: CVPR
Xia, G.S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., Pelillo, M., Zhang, L.: Dota: A large-scale dataset for object detection in aerial images. In: CVPR. pp. 3974–3983 (2018)
2018
-
[43]
In: CVPR
Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: CVPR. pp. 589–597 (2016) 16 T. Shor et al. A Experimental Setup & Implementation Details A.1 RS-OVC Implementation Our implementation of RS-OVC mostly follows the official code published in the CountGD paper, which is the backbone...
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.