pith. machine review for the scientific record. sign in

arxiv: 2604.02773 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: no theorem link

Generalized Small Object Detection:A Point-Prompted Paradigm and Benchmark

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:07 UTC · model grok-4.3

classification 💻 cs.CV
keywords small object detectionpoint promptsinference-time promptinglarge-scale datasetlabel-efficient detectiongeneralizationTinySet-9MDEAL framework
0
0 comments X

The pith

A single point prompt at inference time lets a detector raise small-object localization accuracy by 31 percent over fully supervised baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TinySet-9M, a nine-million-image dataset spanning multiple domains, to address the scarcity of large-scale data for small object detection. It then defines a new Point-Prompt Small Object Detection paradigm in which a user supplies one sparse point at test time to supply missing category semantics that the tiny visual footprint cannot provide. From this paradigm the authors build DEAL, a prompt-conditioned detector trained on the new dataset. The central result is that this single click produces a 31.4 percent relative gain on strict metrics such as AP75 while also transferring to categories and datasets never seen during training.

Core claim

Sparse point prompts supplied at inference time act as an efficient information bridge that augments the weak semantic representations of small objects; the DEAL framework trained on TinySet-9M learns prompt-conditioned features that deliver a 31.4 percent relative improvement in AP75 over fully supervised baselines on that dataset and generalize to unseen categories and unseen datasets.

What carries the argument

The Point-Prompt Small Object Detection (P2SOD) paradigm, in which a single user-supplied point at inference time supplies category-level semantic information to condition the detector.

If this is right

  • A single inference-time click produces a 31.4 percent relative AP75 gain over fully supervised training on TinySet-9M.
  • The same model transfers to categories absent from its training set.
  • The same model transfers to entirely new datasets without retraining.
  • Prompt-conditioned representations learned from large-scale data overcome the performance drop that label-efficient methods suffer on small objects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Interactive single-point guidance may reduce the annotation burden for large-scale small-object tasks in domains such as aerial imagery or medical scans.
  • Existing label-efficient detectors may require explicit inference-time conditioning rather than training-time feature enhancement when objects occupy very few pixels.
  • The benchmark on TinySet-9M provides a concrete test bed for measuring how much semantic information a point must carry to compensate for missing visual detail.

Load-bearing premise

A single point prompt placed by a user will reliably convey enough category identity to overcome the weak visual cues of small objects without introducing placement-dependent bias.

What would settle it

Run DEAL on TinySet-9M with the same point locations replaced by random coordinates inside each image; if the AP75 gain disappears or reverses, the claim that the prompt supplies useful semantic guidance is falsified.

Figures

Figures reproduced from arXiv: 2604.02773 by Chang Xu, Fang Xu, Guangyou Yang, Gui-Song Xia, Haijian Zhang, Haoran Zhu, Ruixiang Zhang, Wen Yang.

Figure 1
Figure 1. Figure 1: Overview of our study on generalized small object detection. Leveraging the proposed TinySet-9M dataset and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Statistical analysis of the TinySet-9M dataset. (a) shows the composition of TinySet-9M, where different source datasets [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of the TinySet-9M dataset. The dataset contains a large number of densely distributed objects with [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Detection results of different detection paradigms on TinySet-9M dataset. Orange boxes denote the detection results [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the proposed method. (a) Illustration of the overall architecture of DEAL, which is built upon the RT [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Inference settings for the P2SOD task. Setting (1) uses one point prompt per category in each image; Setting (2) assigns one point prompt to each object instance; Setting (3) randomly selects a single category and uses one point from that category as the prompt; Setting (4) randomly selects a category and uses point prompts from all its instances. B. Prediction-guided Cyclic Point Prompting Strategy Traini… view at source ↗
Figure 7
Figure 7. Figure 7: Different point prompt position during inference under [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Detection results of zero-shot methods SAM3 and our proposed DEAL on DOTA-v2.0 dataset. Green boxes, red boxes, [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Detection results on images collected from the Internet. All images are resized to a resolution of 1024 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Unseen sceneries feature association visualization. In this figure, green bounding boxes denote ground-truth annotations, [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
read the original abstract

Small object detection (SOD) remains challenging due to extremely limited pixels and ambiguous object boundaries. These characteristics lead to challenging annotation, limited availability of large-scale high-quality datasets, and inherently weak semantic representations for small objects. In this work, we first address the data limitation by introducing TinySet-9M, the first large-scale, multi-domain dataset for small object detection. Beyond filling the gap in large-scale datasets, we establish a benchmark to evaluate the effectiveness of existing label-efficient detection methods for small objects. Our evaluation reveals that weak visual cues further exacerbate the performance degradation of label-efficient methods in small object detection, highlighting a critical challenge in label-efficient SOD. Secondly, to tackle the limitation of insufficient semantic representation, we move beyond training-time feature enhancement and propose a new paradigm termed Point-Prompt Small Object Detection (P2SOD). This paradigm introduces sparse point prompts at inference time as an efficient information bridge for category-level localization, enabling semantic augmentation. Building upon the P2SOD paradigm and the large-scale TinySet-9M dataset, we further develop DEAL (DEtect Any smalL object), a scalable and transferable point-prompted detection framework that learns robust, prompt-conditioned representations from large-scale data. With only a single click at inference time, DEAL achieves a 31.4% relative improvement over fully supervised baselines under strict localization metrics (e.g., AP75) on TinySet-9M, while generalizing effectively to unseen categories and unseen datasets. Our project is available at https://zhuhaoraneis.github.io/TinySet-9M/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TinySet-9M, the first large-scale multi-domain dataset for small object detection, along with a benchmark evaluating label-efficient methods on small objects. It identifies weak visual cues as a core limitation and proposes the Point-Prompt Small Object Detection (P2SOD) paradigm, which supplies sparse point prompts at inference time to enable category-level semantic augmentation. Building on this, the authors present DEAL, a scalable point-prompted framework trained on the new dataset that achieves a 31.4% relative AP75 improvement over fully supervised baselines on TinySet-9M while generalizing to unseen categories and datasets.

Significance. If the quantitative claims hold under realistic conditions, the work is significant for computer vision research on small object detection. The 9M-scale dataset and benchmark directly address data scarcity and provide a standardized testbed for label-efficient approaches. The P2SOD paradigm offers a practical inference-time mechanism that could lower annotation costs in domains with tiny objects. The reported cross-dataset generalization and project release (code and data) are positive contributions that support reproducibility and further research.

major comments (2)
  1. [§4] §4 (Experiments on TinySet-9M) and abstract: The 31.4% relative AP75 gain is obtained with point prompts supplied at inference. No ablation or sensitivity analysis is reported for placement offsets of even 2-3 pixels, which for objects spanning only a few pixels would place the prompt outside the bounding box and break the claimed semantic bridge, reducing performance to the weak visual cues identified in the introduction as the central difficulty.
  2. [§3.2] §3.2 (P2SOD Paradigm): The construction treats the single point as reliably supplying category-level semantic information sufficient to overcome ambiguous boundaries. The manuscript provides no evidence or controlled test that the learned prompt-conditioned representations remain effective when the point misses the object, which is a load-bearing assumption for the single-click usability claim.
minor comments (2)
  1. [Abstract] Abstract: The claim of 'effective generalization to unseen categories and unseen datasets' is stated without naming the target datasets, metrics, or quantitative deltas; these details should be added for clarity.
  2. [§4] Figure captions and §4: Several result tables and plots lack error bars, standard deviations across runs, or statistical significance tests, making it difficult to assess the stability of the reported relative gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's insightful comments on our work introducing TinySet-9M and the DEAL framework for point-prompted small object detection. We provide point-by-point responses to the major comments below.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments on TinySet-9M) and abstract: The 31.4% relative AP75 gain is obtained with point prompts supplied at inference. No ablation or sensitivity analysis is reported for placement offsets of even 2-3 pixels, which for objects spanning only a few pixels would place the prompt outside the bounding box and break the claimed semantic bridge, reducing performance to the weak visual cues identified in the introduction as the central difficulty.

    Authors: We agree that evaluating the sensitivity to point placement offsets is crucial for validating the practical applicability of the P2SOD paradigm, especially given the small size of the objects. The reported results assume points are placed accurately within the object boundaries, consistent with the one-click inference setting. To strengthen the manuscript, we will conduct and report an ablation study simulating random offsets of 1 to 5 pixels around the object center, measuring the impact on AP75. This analysis will be added to Section 4. revision: yes

  2. Referee: [§3.2] §3.2 (P2SOD Paradigm): The construction treats the single point as reliably supplying category-level semantic information sufficient to overcome ambiguous boundaries. The manuscript provides no evidence or controlled test that the learned prompt-conditioned representations remain effective when the point misses the object, which is a load-bearing assumption for the single-click usability claim.

    Authors: The P2SOD paradigm is designed with the assumption that the user-provided point lies on the target object to supply the necessary semantic cue. We acknowledge the lack of explicit testing for off-object points in the current version. In the revision, we will add experiments where points are deliberately placed outside the bounding boxes (e.g., in background regions) to assess the model's behavior and performance drop. This will help clarify the robustness and limitations of the single-click approach, and we will discuss mitigation strategies if needed. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on new benchmark and external generalization tests

full rationale

The paper introduces TinySet-9M and the P2SOD/DEAL framework as an empirical solution to small-object detection. Reported gains (e.g., 31.4% relative AP75 improvement) are measured on the newly collected dataset after training, with explicit additional evaluation on unseen categories and unseen external datasets. No equations, self-definitional constructions, fitted-input-as-prediction steps, or load-bearing self-citations appear in the provided text that would reduce the central performance claims to the inputs by construction. The derivation chain is therefore self-contained experimental validation rather than circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that point prompts at inference supply usable semantic augmentation and that the new dataset is representative of real-world small-object distributions; no new physical entities or ad-hoc constants are introduced beyond standard deep-learning training.

axioms (1)
  • domain assumption Sparse point prompts at inference time provide reliable category-level localization cues for small objects
    Invoked in the definition of the P2SOD paradigm to justify semantic augmentation.

pith-pipeline@v0.9.0 · 5605 in / 1265 out tokens · 47764 ms · 2026-05-13T20:07:17.002582+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. UHR-DETR: Efficient End-to-End Small Object Detection for Ultra-High-Resolution Remote Sensing Imagery

    cs.CV 2026-04 unverdicted novelty 7.0

    UHR-DETR delivers 2.8% higher mAP and 10x faster inference than sliding-window baselines for small object detection in UHR remote sensing imagery on a single 24GB GPU.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Oriented tiny object detection: A dataset, benchmark, and dynamic unbiased learning,

    C. Xu, R. Zhang, W. Yang, H. Zhu, F. Xu, J. Ding, and G.-S. Xia, “Oriented tiny object detection: A dataset, benchmark, and dynamic unbiased learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–18, 2025

  2. [2]

    To- wards large-scale small object detection: Survey and benchmarks,

    G. Cheng, X. Yuan, X. Yao, K. Yan, Q. Zeng, X. Xie, and J. Han, “To- wards large-scale small object detection: Survey and benchmarks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 11, pp. 13 467–13 488, 2023

  3. [3]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inEuropean Conference on Computer Vision. Springer, 2014, pp. 740–755

  4. [4]

    Detecting tiny objects in aerial images: A normalized wasserstein distance and a new benchmark,

    C. Xu, J. Wang, W. Yang, H. Yu, L. Yu, and G.-S. Xia, “Detecting tiny objects in aerial images: A normalized wasserstein distance and a new benchmark,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 190, pp. 79–93, 2022

  5. [5]

    Robust tiny object detection in aerial images amidst label noise,

    H. Zhu, C. Xu, W. Yang, R. Zhang, Y . Zhang, and G.-S. Xia, “Robust tiny object detection in aerial images amidst label noise,”arXiv preprint arXiv:2401.08056, 2024

  6. [6]

    Scale match for tiny person detection,

    X. Yu, Y . Gong, N. Jiang, Q. Ye, and Z. Han, “Scale match for tiny person detection,” inIEEE Workshops on Applications of Computer Vision, 2020, pp. 1257–1265

  7. [7]

    Visible-thermal tiny object detection: A benchmark dataset and baselines,

    X. Ying, C. Xiao, W. An, R. Li, X. He, B. Li, X. Cao, Z. Li, Y . Wang, M. Huet al., “Visible-thermal tiny object detection: A benchmark dataset and baselines,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  8. [8]

    Drone- based rgbt tiny person detection,

    Y . Zhang, C. Xu, W. Yang, G. He, H. Yu, L. Yu, and G.-S. Xia, “Drone- based rgbt tiny person detection,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 204, pp. 61–76, 2023

  9. [9]

    Detection and tracking meet drones challenge,

    P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, “Detection and tracking meet drones challenge,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7380–7399, 2021

  10. [10]

    Robust object detection with inaccurate bounding boxes,

    C. Liu, K. Wang, H. Lu, Z. Cao, and Z. Zhang, “Robust object detection with inaccurate bounding boxes,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 53–69

  11. [11]

    Pseco: Pseudo labeling and consistency training for semi-supervised object detection,

    G. Li, X. Li, Y . Wang, Y . Wu, D. Liang, and S. Zhang, “Pseco: Pseudo labeling and consistency training for semi-supervised object detection,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 457– 472

  12. [12]

    Hs-fpn: High frequency and spatial perception fpn for tiny object detection,

    Z. Shi, J. Hu, J. Ren, H. Ye, X. Yuan, Y . Ouyang, J. He, B. Ji, and J. Guo, “Hs-fpn: High frequency and spatial perception fpn for tiny object detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 7, 2025, pp. 6896–6904

  13. [13]

    Set: Spectral enhancement for tiny object detection,

    H. Sun, R. Wang, Y . Li, L. Yang, S. Lin, X. Cao, and B. Zhang, “Set: Spectral enhancement for tiny object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 4713–4723

  14. [14]

    Label assignment matters: A gaussian assignment strategy for tiny object detection,

    F. Zhang, S. Zhou, Y . Wang, X. Wang, and Y . Hou, “Label assignment matters: A gaussian assignment strategy for tiny object detection,”IEEE Transactions on Geoscience and Remote Sensing, 2024

  15. [15]

    Similarity distance-based label assignment for tiny object detection,

    S. Shi, Q. Fang, X. Xu, and T. Zhao, “Similarity distance-based label assignment for tiny object detection,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024, pp. 13 711– 13 718

  16. [16]

    Interactive object detection for tiny objects in large remotely sensed images,

    M. Burges, S. Zambanini, and R. Sablatnig, “Interactive object detection for tiny objects in large remotely sensed images,” in2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025, pp. 4704–4713

  17. [17]

    The pascal visual object classes challenge: A retrospective,

    M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,”International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, 2015

  18. [18]

    Livecell—a large-scale dataset for label-free live cell segmentation,

    C. Edlund, T. R. Jackson, N. Khalid, N. Bevan, T. Dale, A. Dengel, S. Ahmed, J. Trygg, and R. Sj ¨ogren, “Livecell—a large-scale dataset for label-free live cell segmentation,”Nature methods, vol. 18, no. 9, pp. 1038–1045, 2021

  19. [19]

    Similarity distance-based label assignment for tiny object detection,

    S. Shi, Q. Fang, X. Xu, and T. Zhao, “Similarity distance-based label assignment for tiny object detection,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 13 711–13 718

  20. [20]

    Rfla: Gaussian receptive field based label assignment for tiny object detection,

    C. Xu, J. Wang, W. Yang, H. Yu, L. Yu, and G.-S. Xia, “Rfla: Gaussian receptive field based label assignment for tiny object detection,” in European Conference on Computer Vision. Springer, 2022, pp. 526– 543

  21. [21]

    Frli-net: Feature reconstruction and learning interaction network for tiny object detection in remote sensing images,

    P. Chen, J. Wang, Z. Zhang, and C. He, “Frli-net: Feature reconstruction and learning interaction network for tiny object detection in remote sensing images,”IEEE Signal Processing Letters, vol. 32, pp. 2159– 2163, 2025

  22. [22]

    Feature information driven position gaussian distribution estimation for tiny object detection,

    J. Bian, M. Feng, W. Dong, F. Wu, J. Luo, Y . Wang, and G. Shi, “Feature information driven position gaussian distribution estimation for tiny object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 30 376–30 386

  23. [23]

    Spatial self- distillation for object detection with inaccurate bounding boxes,

    D. Wu, P. Chen, X. Yu, G. Li, Z. Han, and J. Jiao, “Spatial self- distillation for object detection with inaccurate bounding boxes,” in JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 17 Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6855–6865

  24. [24]

    End-to-end semi-supervised object detection with soft teacher,

    M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, and Z. Liu, “End-to-end semi-supervised object detection with soft teacher,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3060–3069

  25. [25]

    Consistent-teacher: Towards reducing inconsis- tent pseudo-targets in semi-supervised object detection,

    X. Wang, X. Yang, S. Zhang, Y . Li, L. Feng, S. Fang, C. Lyu, K. Chen, and W. Zhang, “Consistent-teacher: Towards reducing inconsis- tent pseudo-targets in semi-supervised object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 3240–3249

  26. [26]

    Point-to-box network for accurate object detection via single point supervision,

    P. Chen, X. Yu, X. Han, N. Hassan, K. Wang, J. Li, J. Zhao, H. Shi, Z. Han, and Q. Ye, “Point-to-box network for accurate object detection via single point supervision,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 51–67

  27. [27]

    Tiny object detection with single point supervision,

    H. Zhu, C. Xu, R. Zhang, F. Xu, W. Yang, H. Zhang, and G.-S. Xia, “Tiny object detection with single point supervision,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 227, pp. 219–233, 2025

  28. [28]

    Point2rbox-v3: Self-bootstrapping from point annotations via integrated pseudo-label refinement and utilization,

    T. Zhang, Z. Fan, M. Liu, X. Zhang, X. Lu, W. Li, Y . Zhou, Y . Yu, X. Li, J. Yanet al., “Point2rbox-v3: Self-bootstrapping from point annotations via integrated pseudo-label refinement and utilization,”arXiv preprint arXiv:2509.26281, 2025

  29. [29]

    Pointobb-v3: Expanding performance boundaries of single point-supervised oriented object detection,

    P. Zhang, J. Luo, X. Yang, Y . Yu, Q. Li, Y . Zhou, X. Jia, X. Lu, J. Chen, X. Liet al., “Pointobb-v3: Expanding performance boundaries of single point-supervised oriented object detection,”International Journal of Computer Vision, pp. 1–21, 2025

  30. [30]

    Calibrated teacher for sparsely annotated object detection,

    H. Wang, L. Liu, B. Zhang, J. Zhang, W. Zhang, Z. Gan, Y . Wang, C. Wang, and H. Wang, “Calibrated teacher for sparsely annotated object detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 2519–2527

  31. [31]

    Minimizing sample redundancy for label-efficient object detection in aerial images,

    R. Zhang, C. Xu, H. Zhu, F. Xu, W. Yang, H. Zhang, and G.-S. Xia, “Minimizing sample redundancy for label-efficient object detection in aerial images,”IEEE Transactions on Geoscience and Remote Sensing, 2025

  32. [32]

    Co-mining: Self-supervised learning for sparsely annotated object detection,

    T. Wang, T. Yang, J. Cao, and X. Zhang, “Co-mining: Self-supervised learning for sparsely annotated object detection,” inProceedings of the AAAI conference on artificial intelligence, vol. 35, no. 4, 2021, pp. 2800– 2808

  33. [33]

    Sparsedet: Improving sparsely annotated object detection with pseudo-positive mining,

    S. Suri, S. Rambhatla, R. Chellappa, and A. Shrivastava, “Sparsedet: Improving sparsely annotated object detection with pseudo-positive mining,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6770–6781

  34. [34]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

    S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Suet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inEuropean conference on computer vision. Springer, 2024, pp. 38–55

  35. [35]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y . Chen, F. Yanet al., “Grounded sam: Assembling open-world models for diverse visual tasks,”arXiv preprint arXiv:2401.14159, 2024

  36. [36]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  37. [37]

    SAM 3: Segment Anything with Concepts

    N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huanget al., “Sam 3: Segment anything with concepts,”arXiv preprint arXiv:2511.16719, 2025

  38. [38]

    arXiv preprint arXiv:2510.12798 (2025) 10 M

    Q. Jiang, J. Huo, X. Chen, Y . Xiong, Z. Zeng, Y . Chen, T. Ren, J. Yu, and L. Zhang, “Detect anything via next point prediction,”arXiv preprint arXiv:2510.12798, 2025

  39. [39]

    Hrsid: A high-resolution sar images dataset for ship detection and instance segmentation,

    S. Wei, X. Zeng, Q. Qu, M. Wang, H. Su, and J. Shi, “Hrsid: A high-resolution sar images dataset for ship detection and instance segmentation,”Ieee Access, vol. 8, pp. 120 234–120 254, 2020

  40. [40]

    Isnet: Shape matters for infrared small target detection,

    M. Zhang, R. Zhang, Y . Yang, H. Bai, J. Zhang, and J. Guo, “Isnet: Shape matters for infrared small target detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 877–886

  41. [41]

    Ls-ssdd-v1. 0: A deep learning dataset dedicated to small ship detection from large-scale sentinel-1 sar images,

    T. Zhang, X. Zhang, X. Ke, X. Zhan, J. Shi, S. Wei, D. Pan, J. Li, H. Su, Y . Zhouet al., “Ls-ssdd-v1. 0: A deep learning dataset dedicated to small ship detection from large-scale sentinel-1 sar images,”Remote Sensing, vol. 12, no. 18, p. 2997, 2020

  42. [42]

    Sardet- 100k: Towards open-source benchmark and toolkit for large-scale sar object detection,

    Y . Li, X. Li, W. Li, Q. Hou, L. Liu, M.-M. Cheng, and J. Yang, “Sardet- 100k: Towards open-source benchmark and toolkit for large-scale sar object detection,”Advances in Neural Information Processing Systems, vol. 37, pp. 128 430–128 461, 2024

  43. [43]

    Sirst-5k: Exploring massive negatives synthesis with self-supervised learning for robust infrared small target detection,

    Y . Lu, Y . Lin, H. Wu, X. Xian, Y . Shi, and L. Lin, “Sirst-5k: Exploring massive negatives synthesis with self-supervised learning for robust infrared small target detection,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–11, 2024

  44. [44]

    Crowdhuman: A benchmark for detecting human in a crowd,

    S. Shao, Z. Zhao, B. Li, T. Xiao, G. Yu, X. Zhang, and J. Sun, “Crowdhuman: A benchmark for detecting human in a crowd,”arXiv preprint arXiv:1805.00123, 2018

  45. [45]

    20k bounding boxes, dflbundesliga data shootout,

    enddl22, “20k bounding boxes, dflbundesliga data shootout,” https://www.kaggle.com/datasets/enddl22/ bounding-boxes-dflbundesliga-data-shootout, 2021, kaggle

  46. [46]

    Jhu-crowd++: Large-scale crowd counting dataset and a benchmark method,

    V . A. Sindagi, R. Yasarla, and V . M. Patel, “Jhu-crowd++: Large-scale crowd counting dataset and a benchmark method,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 5, pp. 2594–2609, 2020

  47. [47]

    Detecting heads using feature refine net and cascaded multi-scale architecture,

    D. Peng, Z. Sun, Z. Chen, Z. Cai, L. Xie, and L. Jin, “Detecting heads using feature refine net and cascaded multi-scale architecture,” in2018 24th International Conference on Pattern Recognition (ICPR), 2018, pp. 2528–2533

  48. [48]

    Sh17: A dataset for human safety and personal protective equipment detection in manufacturing industry,

    H. M. Ahmad and A. Rahimi, “Sh17: A dataset for human safety and personal protective equipment detection in manufacturing industry,” Journal of Safety Science and Resilience, vol. 6, no. 2, pp. 175– 185, 2025. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/S266644962400077X

  49. [49]

    Precise detection in densely packed scenes,

    E. Goldman, R. Herzig, A. Eisenschtat, J. Goldberger, and T. Hassner, “Precise detection in densely packed scenes,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5227–5236

  50. [50]

    License plate detection dataset,

    F. Elmenshawii, “License plate detection dataset,” https://www.kaggle. com/datasets/fareselmenshawii/license-plate-dataset, 2024, kaggle

  51. [51]

    The tyc dataset for under- standing instance-level semantics and motions of cells in microstruc- tures,

    C. Reich, T. Prangemeier, and H. Koeppl, “The tyc dataset for under- standing instance-level semantics and motions of cells in microstruc- tures,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3940–3951

  52. [52]

    Visalgae 2023: A dataset and challenge for al- gae detection in microscopy images,

    M. Sun, J. Jiang, Z. Yang, S. Kong, J. Qi, J. Shang, S. Luo, W. Sun, T. Wang, Y . Wanget al., “Visalgae 2023: A dataset and challenge for al- gae detection in microscopy images,”arXiv preprint arXiv:2505.20687, 2025

  53. [53]

    Infrared small object detection for wildlife con- servation,

    wangsheng0352, “Infrared small object detection for wildlife con- servation,” https://www.kaggle.com/datasets/wangsheng0352/ir-wildlife, 2024, kaggle

  54. [54]

    Pest24: A large-scale very small object data set of agricultural pests for multi-target detection,

    Q.-J. Wang, S.-Y . Zhang, S.-F. Dong, G.-C. Zhang, J. Yang, R. Li, and H.-Q. Wang, “Pest24: A large-scale very small object data set of agricultural pests for multi-target detection,”Computers and electronics in agriculture, vol. 175, p. 105585, 2020

  55. [55]

    The detection and counting of olive tree fruits using deep learning models in tacna, per ´u,

    E. Osco-Mamani, O. Santana-Carbajal, I. Chaparro-Cruz, D. Ochoa- Donoso, and S. Alcazar-Alay, “The detection and counting of olive tree fruits using deep learning models in tacna, per ´u,”AI, vol. 6, no. 2, p. 25, 2025

  56. [56]

    Focal loss for dense object detection,

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,” inIEEE International Conference on Computer Vision, 2017, pp. 2980–2988

  57. [57]

    Faster R-CNN: Towards real- time object detection with region proposal networks,

    S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- time object detection with region proposal networks,” inAdvances in Neural Information Processing Systems, 2015, pp. 91–99

  58. [58]

    Fcos: A simple and strong anchor-free object detector,

    Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: A simple and strong anchor-free object detector,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 4, pp. 1922–1933, 2022

  59. [59]

    Detrs beat yolos on real-time object detection,

    Y . Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y . Liu, and J. Chen, “Detrs beat yolos on real-time object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 16 965–16 974

  60. [60]

    Pointobb-v2: Towards simpler, faster, and stronger single point supervised oriented object detection,

    B. Ren, X. Yang, Y . Yu, J. Luo, and Z. Deng, “Pointobb-v2: Towards simpler, faster, and stronger single point supervised oriented object detection,”arXiv preprint arXiv:2410.08210, 2024

  61. [61]

    Object detection in optical remote sensing images: A survey and a new benchmark,

    K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in optical remote sensing images: A survey and a new benchmark,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 159, pp. 296–307, 2020

  62. [62]

    Object detection in aerial images: A large-scale benchmark and challenges,

    J. Ding, N. Xue, G.-S. Xia, X. Bai, W. Yang, M. Y . Yang, S. Belongie, J. Luo, M. Datcu, M. Pelilloet al., “Object detection in aerial images: A large-scale benchmark and challenges,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7778–7796, 2021

  63. [63]

    xview: Objects in context in overhead imagery,

    D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli, M. Klaric, Y . Bulatov, and B. McCord, “xview: Objects in context in overhead imagery,”arXiv preprint arXiv:1802.07856, 2018

  64. [64]

    FCOS: Fully convolutional one- stage object detection,

    Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully convolutional one- stage object detection,” inIEEE International Conference on Computer Vision, 2019, pp. 9627–9636. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18

  65. [65]

    Geography-aware self-supervised learning,

    K. Ayush, B. Uzkent, C. Meng, K. Tanmay, M. Burke, D. Lobell, and S. Ermon, “Geography-aware self-supervised learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 181–10 190

  66. [66]

    Change-aware sampling and con- trastive learning for satellite images,

    U. Mall, B. Hariharan, and K. Bala, “Change-aware sampling and con- trastive learning for satellite images,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5261–5270

  67. [67]

    Tov: The original vision model for optical remote sensing image understanding via self- supervised learning,

    C. Tao, J. Qi, G. Zhang, Q. Zhu, W. Lu, and H. Li, “Tov: The original vision model for optical remote sensing image understanding via self- supervised learning,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 4916–4930, 2023

  68. [68]

    Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning,

    C. J. Reed, R. Gupta, S. Li, S. Brockman, C. Funk, B. Clipp, K. Keutzer, S. Candido, M. Uyttendaele, and T. Darrell, “Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4088–4099

  69. [69]

    Satlaspretrain: A large-scale dataset for remote sensing image under- standing,

    F. Bastani, P. Wolters, R. Gupta, J. Ferdinando, and A. Kembhavi, “Satlaspretrain: A large-scale dataset for remote sensing image under- standing,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 16 772–16 782

  70. [70]

    Ringmo: A remote sensing foundation model with masked image modeling,

    X. Sun, P. Wang, W. Lu, Z. Zhu, X. Lu, Q. He, J. Li, X. Rong, Z. Yang, H. Changet al., “Ringmo: A remote sensing foundation model with masked image modeling,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–22, 2022

  71. [71]

    Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery,

    X. Guo, J. Lao, B. Dang, Y . Zhang, L. Yu, L. Ru, L. Zhong, Z. Huang, K. Wu, D. Huet al., “Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 672–27 683

  72. [72]

    Mtp: Advancing remote sensing foundation model via multitask pretraining,

    D. Wang, J. Zhang, M. Xu, L. Liu, D. Wang, E. Gao, C. Han, H. Guo, B. Du, D. Taoet al., “Mtp: Advancing remote sensing foundation model via multitask pretraining,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17, pp. 11 632–11 654, 2024

  73. [73]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  74. [74]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

  75. [75]

    Pvt v2: Improved baselines with pyramid vision transformer,

    W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pvt v2: Improved baselines with pyramid vision transformer,” Computational visual media, vol. 8, no. 3, pp. 415–424, 2022

  76. [76]

    DINOv3

    O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoaet al., “Dinov3,” arXiv preprint arXiv:2508.10104, 2025

  77. [77]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022