pith. machine review for the scientific record. sign in

arxiv: 2604.06332 · v1 · submitted 2026-04-07 · 💻 cs.CV · cs.LG

Recognition: no theorem link

Telescope: Learnable Hyperbolic Foveation for Ultra-Long-Range Object Detection

Dmitriy Rivkin, Felix Heide, Mario Bijelic, Parker Ewen

Pith reviewed 2026-05-10 18:42 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords ultra-long-range object detectionhyperbolic foveationautonomous drivingsmall object detectionimage re-samplinghighway safetytwo-stage detection
0
0 comments X

The pith

Telescope uses a learnable hyperbolic foveation layer to raise mAP for objects beyond 250 meters from 0.185 to 0.326.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Telescope, a two-stage object detector for autonomous highway driving that adds a novel re-sampling layer and image transformation based on learnable hyperbolic foveation. This targets the problem that distant vehicles and obstacles occupy only a few pixels in camera images, causing standard detectors to fail at the ranges required for safe braking at high speeds. Image-based detection is presented as the practical way to reach beyond 500 meters because current LiDAR sensors lose resolution too quickly with distance. The model reports a 76 percent relative mAP gain at ultra-long ranges while adding little computation and preserving accuracy at shorter distances.

Core claim

Telescope combines a standard detection backbone with a re-sampling layer that applies a trainable hyperbolic foveation transformation to the input image. The transformation enlarges the effective resolution of regions containing small, distant objects. On driving scenes this produces a 76 percent relative improvement in mean average precision for detection beyond 250 meters, moving absolute mAP from 0.185 to 0.326, with minimal added cost and no degradation at closer ranges.

What carries the argument

The learnable hyperbolic foveation re-sampling layer, a module that uses a trainable hyperbolic mapping to re-sample the image and allocate higher pixel density to distant scene regions.

If this is right

  • Autonomous vehicles can detect critical objects at braking distances required for high-speed highway operation.
  • Image-only detection becomes viable for ultra-long ranges without requiring upgraded LiDAR hardware.
  • The same model maintains competitive performance at short and medium ranges.
  • The approach adds only modest computational overhead to existing detection pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same re-sampling idea could be tested on other vision tasks where scale varies sharply across an image, such as aerial surveillance.
  • End-to-end training of the foveation parameters may allow similar gains in domains where biological foveation has not yet been adapted.
  • Combining the layer with temporal fusion across video frames could further extend reliable detection range.

Load-bearing premise

The hyperbolic re-sampling layer must increase effective resolution on distant objects without creating artifacts that harm detection or reduce accuracy on nearby objects.

What would settle it

Run the trained Telescope model on a new set of real highway images containing objects at distances over 250 meters and observe no mAP gain or visible distortions in the transformed image regions.

Figures

Figures reproduced from arXiv: 2604.06332 by Dmitriy Rivkin, Felix Heide, Mario Bijelic, Parker Ewen.

Figure 1
Figure 1. Figure 1: Long-range Objects in Driving Datasets. Analysis of the TruckDrive [16] dataset shows the distribution of object dis￾tances and the breakdown of the pixel-wise composition of objects at each distance. While all object ranges are equally represented in images, the proportion of pixel area disproportionately favors nearby objects, with long (150 − 250m) and ultra-long (≥ 250m) objects occupying only a small … view at source ↗
Figure 2
Figure 2. Figure 2: Telescope. We propose a two-stage ultra-long range detection model. Stage one uses a down-sampled image to estimate the hyperbolic foveation image transformation parameters. This transformation enlarges distant objects at the center of the transform while shrinking nearby objects at the periphery. Stage two uses this transformed image alongside learned hyperbolic embeddings to detect objects at distances o… view at source ↗
Figure 3
Figure 3. Figure 3: Hyperbolic Foveated Transform. The transformation coefficients enable the re-parameterization of the bounding box in the induced Riemannian space where the box center and tangent vector magnitudes fully describe the box location and shape. where w(r) = (1−min(r/R, 1))p is the radial interpolation coefficient, p > 0 is the fixed blending exponent, and R > 0 is the radial scale of the Poincare disk. For ´ r … view at source ↗
Figure 5
Figure 5. Figure 5: Learned hyperbolic foveated transform on the Truck￾Drive dataset. The original image (left) and the foveated image (right) are shown, together with the percentage increase in object bounding-box area. Both views are cropped to the same image re￾gion, highlighting the local magnification induced by the foveation transform. The proposed transform is effective for both isolated targets and dense, busy scenes.… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative Visualization. Detections from Telescope on the TruckDrive [16] dataset. Telescope consistently detects and localizes distant vehicles that occupy only a few pixels, while pre￾serving accurate predictions for nearby objects. These examples highlight the effect of the proposed hyperbolic foveated transform in magnifying ultra-long range regions and improving sensitiv￾ity to objects in these rang… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative Comparison. Qualitative comparison between the proposed method, Telescope, and state-of-the-art baselines. Both RVSA [44] and RFLA [52] are specialized for small object detection while DETR [5] and DINO [56] are strong general object detectors, but perform worse in long an ultra-long range object detection. Ground truth annotations are shown on the left. Zoomed-in views corresponding to the red… view at source ↗
Figure 8
Figure 8. Figure 8: Long-range Objects in Argoverse Driving. Analysis of the Argoverse [47] dataset shows the distribution of object dis￾tances and the breakdown of the pixel-wise composition of objects at each distance. Multiple object ranges are represented in images, but nearby objects are disproportionately favored in terms of pixel area, with far (50-150m) and long (150-250m) range objects occu￾pying only a small fractio… view at source ↗
Figure 9
Figure 9. Figure 9: Additional Qualitative Comparison on Argoverse Dataset. Qualitative comparison between the proposed method, Telescope, and state-of-the-art baselines specialized for small object detection. Ground truth annotations are shown on the left. Notably, there are many target boxes which represent occluded objects (rows 1, 2, 3, 6, and 7). All methods are fine-tuned on the Argoverse [47] dataset. 14 [PITH_FULL_IM… view at source ↗
Figure 10
Figure 10. Figure 10: Additional Qualitative Comparison on TruckDrive Dataset. Qualitative comparison between the proposed method, Tele￾scope, and state-of-the-art baselines. Both RVSA [44] and RFLA [52] are specialized for small object detection while DETR [5] and DINO [56] are strong general object detectors, but perform worse in long an ultra-long range object detection. Ground truth annotations are shown on the left. Zoome… view at source ↗
read the original abstract

Autonomous highway driving, especially for long-haul heavy trucks, requires detecting objects at long ranges beyond 500 meters to satisfy braking distance requirements at high speeds. At long distances, vehicles and other critical objects occupy only a few pixels in high-resolution images, causing state-of-the-art object detectors to fail. This challenge is compounded by the limited effective range of commercially available LiDAR sensors, which fall short of ultra-long range thresholds because of quadratic loss of resolution with distance, making image-based detection the most practically scalable solution given commercially available sensor constraints. We introduce Telescope, a two-stage detection model designed for ultra-long range autonomous driving. Alongside a powerful detection backbone, this model contains a novel re-sampling layer and image transformation to address the fundamental challenges of detecting small, distant objects. Telescope achieves $76\%$ relative improvement in mAP in ultra-long range detection compared to state-of-the-art methods (improving from an absolute mAP of 0.185 to 0.326 at distances beyond 250 meters), requires minimal computational overhead, and maintains strong performance across all detection ranges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Telescope, a two-stage object detector for ultra-long-range autonomous driving that incorporates a learnable hyperbolic foveation re-sampling layer and image transformation to increase effective resolution for distant objects occupying few pixels. It reports a 76% relative mAP improvement (from 0.185 to 0.326) for objects beyond 250 m on a highway driving dataset, with claims of no degradation at shorter ranges, minimal computational overhead, and motivation from foveated vision principles implemented in a differentiable manner.

Significance. If the reported gains hold under rigorous verification, the work could meaningfully advance image-based long-range perception for high-speed highway scenarios where LiDAR range is insufficient. The differentiable hyperbolic re-sampling is a concrete technical contribution that aligns with biological foveation and could be adopted in other detectors; the absence of circularity in the empirical gains (as they are not reduced to fitted quantities by construction) strengthens the case for further investigation.

major comments (2)
  1. [Abstract] Abstract and results: the headline mAP values (0.185 baseline to 0.326) are presented without error bars, standard deviations from multiple runs, or statistical significance tests; this directly affects confidence in the 76% relative improvement claim for the >250 m regime.
  2. [Results] Evaluation: the distance-thresholded mAP (>250 m) requires explicit details on dataset size, object count in the long-range subset, distance measurement method, and whether the split is fixed or cross-validated; without these, the improvement cannot be fully assessed as robust rather than dataset-specific.
minor comments (2)
  1. The paper should include an ablation study isolating the contribution of the learnable hyperbolic parameters versus the backbone or other components to confirm the source of the gain.
  2. [Methods] Clarify the exact parameterization of the hyperbolic foveation layer (e.g., the form of the learnable parameters and their initialization) in the methods section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. We appreciate the recognition of the technical contribution and the potential impact on long-range perception. We address each major comment below, indicating where revisions will be incorporated to improve clarity and robustness.

read point-by-point responses
  1. Referee: [Abstract] Abstract and results: the headline mAP values (0.185 baseline to 0.326) are presented without error bars, standard deviations from multiple runs, or statistical significance tests; this directly affects confidence in the 76% relative improvement claim for the >250 m regime.

    Authors: We acknowledge that reporting variability measures would strengthen the presentation of the headline results. The reported mAP values are from a single training run, as is common in large-scale detection experiments due to computational constraints on our highway dataset. In the revised manuscript we will add an explicit statement in both the abstract and the results section noting the single-run nature of the evaluation and will include additional supporting evidence from ablation studies showing consistent relative gains across multiple distance thresholds and backbone variants. We will also add a brief discussion of why formal significance testing was not performed. revision: partial

  2. Referee: [Results] Evaluation: the distance-thresholded mAP (>250 m) requires explicit details on dataset size, object count in the long-range subset, distance measurement method, and whether the split is fixed or cross-validated; without these, the improvement cannot be fully assessed as robust rather than dataset-specific.

    Authors: We agree that these details are essential for assessing robustness. The main paper summarized the dataset at a high level while deferring some specifics to the supplementary material. In the revised version we will expand the 'Dataset and Evaluation Protocol' subsection to explicitly state: the total number of images and annotations, the number of objects beyond 250 m in the test set, the distance measurement procedure (camera-LiDAR fusion with GPS ground truth), and confirmation that the train/test split follows the dataset's fixed protocol without cross-validation. These additions will be placed in the main text rather than only in the supplement. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results self-contained

full rationale

The manuscript introduces a two-stage detector with a differentiable hyperbolic re-sampling layer motivated by foveated vision, then reports mAP gains (0.185 to 0.326) from training and distance-thresholded evaluation on highway data. No equations, uniqueness theorems, or predictions are shown that reduce the reported improvement to a fitted quantity or self-citation by construction. The central claim rests on experimental tables and architecture details that remain independently verifiable against external datasets and baselines.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard supervised deep-learning training assumptions plus the design of the novel re-sampling layer; no explicit free parameters or invented entities are named in the abstract.

free parameters (1)
  • learnable hyperbolic foveation parameters
    Parameters of the re-sampling layer that are optimized during training to produce the reported long-range gains.

pith-pipeline@v0.9.0 · 5500 in / 1045 out tokens · 36276 ms · 2026-05-10T18:42:32.909640+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 14 canonical work pages · 9 internal anchors

  1. [1]

    In: Proceed- ings of the IEEE conference on computer vision and pattern recognition

    Bai, Y ., Zhang, Y ., Ding, M., Ghanem, B.: Finding tiny faces in the wild with generative adversarial network. In: Proceed- ings of the IEEE conference on computer vision and pattern recognition. pp. 21–30 (2018)

  2. [2]

    Perception Encoder: The best visual embeddings are not at the output of the network

    Bolya, D., Huang, P.Y ., Sun, P., Cho, J.H., Madotto, A., Wei, C., Ma, T., Zhi, J., Rajasegaran, J., Rasheed, H., et al.: Perception Encoder: The best visual embeddings are not at the output of the network. arXiv preprint arXiv:2504.13181 (2025)

  3. [3]

    In: Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition

    Caesar, H., Bankiti, V ., Lang, A.H., V ora, S., Liong, V .E., Xu, Q., Krishnan, A., Pan, Y ., Baldan, G., Beijbom, O.: nuScenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition. pp. 11621–11631 (2020)

  4. [4]

    SAM 3: Segment Anything with Concepts

    Carion, N., Gustafson, L., Hu, Y .T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V ., Khedr, H., Huang, A.e.a.: SAM 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719 (2025)

  5. [5]

    In: European conference on computer vision

    Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with trans- formers. In: European conference on computer vision. pp. 213–229. Springer (2020)

  6. [6]

    In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition

    Chang, M.F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A., Wang, D., Carr, P., Lucey, S., Ramanan, D., et al.: Argoverse: 3D tracking and forecasting with rich maps. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 8748–8757 (2019)

  7. [7]

    In: Asian conference on computer vision

    Chen, C., Liu, M.Y ., Tuzel, O., Xiao, J.: R-CNN for small object detection. In: Asian conference on computer vision. pp. 214–230. Springer (2016)

  8. [8]

    MMDetection: Open mmlab detection toolbox and benchmark,

    Chen, K., Wang, J., Pang, J., Cao, Y ., Xiong, Y ., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y ., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: MMDetec- tion: Open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)

  9. [9]

    IEEE transactions on pattern analysis and machine intelligence45, 13467–13488 (2023)

    Cheng, G., Yuan, X., Yao, X., Yan, K., Zeng, Q., Xie, X., Han, J.: Towards large-scale small object detection: Survey and benchmarks. IEEE transactions on pattern analysis and machine intelligence45, 13467–13488 (2023)

  10. [10]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)

  11. [11]

    In: Proceedings of the IEEE/CVF interna- tional conference on computer vision

    Dai, X., Chen, Y ., Yang, J., Zhang, P., Yuan, L., Zhang, L.: Dynamic DETR: End-to-end object detection with dy- namic attention. In: Proceedings of the IEEE/CVF interna- tional conference on computer vision. pp. 2988–2997 (2021)

  12. [12]

    In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Dai, Z., Cai, B., Lin, Y ., Chen, J.: UP-DETR: Unsupervised pre-training for object detection with transformers. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1601–1610 (2021)

  13. [13]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    De Plaen, H., De Plaen, P.F., Suykens, J.A., Proesmans, M., Tuytelaars, T., Van Gool, L.: Unbalanced optimal transport: A unified framework for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3198–3207 (2023)

  14. [14]

    In: 2009 IEEE conference on computer vision and pattern recog- nition

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recog- nition. pp. 248–255. Ieee (2009)

  15. [15]

    In: 2012 IEEE conference on computer vision and pattern recog- nition

    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for au- tonomous driving? The KITTI vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recog- nition. pp. 3354–3361. IEEE (2012)

  16. [16]

    In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2026) 9

    Ghilotti, F., Palladin, E., Brucker, S., Sigal, A., Bijelic, M., Heide, F.: TruckDrive: Long-range autonomous highway driving dataset. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2026) 9

  17. [17]

    In: Pro- ceedings of the IEEE/CVF winter conference on applications of computer vision

    Gong, Y ., Yu, X., Ding, Y ., Peng, X., Zhao, J., Han, Z.: Ef- fective fusion factor in FPN for tiny object detection. In: Pro- ceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 1160–1168 (2021)

  18. [18]

    IEEE transactions on circuits and systems for video technology34(1), 221–234 (2023)

    Guo, G., Chen, P., Yu, X., Han, Z., Ye, Q., Gao, S.: Save the tiny, save the all: Hierarchical activation network for tiny object detection. IEEE transactions on circuits and systems for video technology34(1), 221–234 (2023)

  19. [19]

    YOLOv8 to YOLO11: A Comprehensive Architecture In-depth Comparative Review

    Hidayatullah, P., Syakrani, N., Sholahuddin, M.R., Gelar, T., Tubagus, R.: YOLOv8 to YOLO11: A comprehen- sive architecture in-depth comparative review. arXiv preprint arXiv:2501.13400 (2025)

  20. [20]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops

    Huang, X., Cheng, X., Geng, Q., Cao, B., Zhou, D., Wang, P., Lin, Y ., Yang, R.: The Apolloscape dataset for au- tonomous driving. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 954–960 (2018)

  21. [21]

    arXiv preprint arXiv:2205.04529 (2022)

    Jabbireddy, S., Sun, X., Meng, X., Varshney, A.: Foveated rendering: Motivation, taxonomy, and research directions. arXiv preprint arXiv:2205.04529 (2022)

  22. [22]

    Advances in neural information pro- cessing systems28(2015)

    Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. Advances in neural information pro- cessing systems28(2015)

  23. [23]

    IEEE signal processing letters28, 1026–1030 (2021)

    Lee, G., Hong, S., Cho, D.: Self-supervised feature enhance- ment networks for small object detection in noisy images. IEEE signal processing letters28, 1026–1030 (2021)

  24. [24]

    Lee, J.M.: Introduction to Riemannian manifolds, vol. 2. Springer (2018)

  25. [25]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L.: DN-DETR: Accelerate DETR training by introducing query denoising. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13619–13627 (2022)

  26. [26]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Lin, T.Y ., Doll´ar, P., Girshick, R., He, K., Hariharan, B., Be- longie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2117–2125 (2017)

  27. [27]

    IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing17, 15721–15734 (2024)

    Liu, J., Zhang, J., Ni, Y ., Chi, W., Qi, Z.: Small-object de- tection in remote sensing images with super-resolution per- ception. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing17, 15721–15734 (2024)

  28. [28]

    In: European conference on computer vision

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding DINO: Mar- rying DINO with grounded pre-training for open-set object detection. In: European conference on computer vision. pp. 38–55. Springer (2024)

  29. [29]

    International Journal of Computer Vision131(8), 1909–1963 (2023)

    Mao, J., Shi, S., Wang, X., Li, H.: 3D object detection for autonomous driving: A comprehensive survey. International Journal of Computer Vision131(8), 1909–1963 (2023)

  30. [30]

    Sensors23(15), 6887 (2023)

    Mirzaei, B., Nezamabadi-Pour, H., Raoof, A., Derakhshani, R.: Small object detection and tracking: A comprehensive review. Sensors23(15), 6887 (2023)

  31. [31]

    Jour- nal of electrical and computer engineering2020(1), 3189691 (2020)

    Nguyen, N.D., Do, T., Ngo, T.D., Le, D.D.: An evaluation of deep learning methods for small object detection. Jour- nal of electrical and computer engineering2020(1), 3189691 (2020)

  32. [32]

    In: Proceedings of the IEEE/CVF international conference on computer vi- sion

    Noh, J., Bae, W., Lee, W., Seo, J., Kim, G.: Better to follow, follow to be better: Towards precise supervision of feature super-resolution for small object detection. In: Proceedings of the IEEE/CVF international conference on computer vi- sion. pp. 9725–9734 (2019)

  33. [33]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Haziza, D., Massa, F., El- Nouby, A., et al.: DINOv2: Learning robust visual fea- tures without supervision. arXiv preprint arXiv:2304.07193 (2023)

  34. [34]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V ., Hu, Y .T., Hu, R., Ryali, C., Ma, T., Khedr, H., R ¨adle, R., Rolland, C., Gustafson, L.e.a.: SAM 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

  35. [35]

    In: Proceed- ings of the IEEE conference on computer vision and pattern recognition

    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceed- ings of the IEEE conference on computer vision and pattern recognition. pp. 779–788 (2016)

  36. [36]

    Advances in neural information processing systems 28(2015)

    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: To- wards real-time object detection with region proposal net- works. Advances in neural information processing systems 28(2015)

  37. [37]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 658–666 (2019)

  38. [38]

    arXiv preprint arXiv:2103.14027 (2021)

    Shinya, Y .: USB: Universal-scale object detection bench- mark. arXiv preprint arXiv:2103.14027 (2021)

  39. [39]

    DINOv3

    Sim ´eoni, O., V o, H.V ., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V ., Szafraniec, M., Yi, S., Ramamon- jisoa, M., et al.: DINOv3. arXiv preprint arXiv:2508.10104 (2025)

  40. [40]

    Complex Variables and Elliptic Equations24(3-4), 249–265 (1994)

    Stanoyevitch, A., Stegenga, D.A.: The geometry of Poincar ´e disks. Complex Variables and Elliptic Equations24(3-4), 249–265 (1994)

  41. [41]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Pat- naik, V ., Tsui, P., Guo, J., Zhou, Y ., Chai, Y ., Caine, B., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2446–2454 (2020)

  42. [42]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Thavamani, C., Li, M., Cebron, N., Ramanan, D.: FOVEA: Foveated image magnification for autonomous navigation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 15539–15548 (2021)

  43. [43]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Tian, Z., Shen, C., Chen, H., He, T.: FCOS: Fully convo- lutional one-stage object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9627–9636 (2019)

  44. [44]

    IEEE transactions on geo- science and remote sensing61, 1–15 (2022)

    Wang, D., Zhang, Q., Xu, Y ., Zhang, J., Du, B., Tao, D., Zhang, L.: Advancing plain vision transformer toward re- mote sensing foundation model. IEEE transactions on geo- science and remote sensing61, 1–15 (2022)

  45. [45]

    A normal- ized Gaussian Wasserstein distance for tiny object detection

    Wang, J., Xu, C., Yang, W., Yu, L.: A normalized Gaussian Wasserstein distance for tiny object detection. arXiv preprint arXiv:2110.13389 (2021) 10

  46. [46]

    Neural Computing and Applications36(12), 6283–6303 (2024)

    Wei, W., Cheng, Y ., He, J., Zhu, X.: A review of small ob- ject detection based on deep learning. Neural Computing and Applications36(12), 6283–6303 (2024)

  47. [47]

    Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

    Wilson, B., Qi, W., Agarwal, T., Lambert, J., Singh, J., Khan- delwal, S., Pan, B., Kumar, R., Hartnett, A., Pontes, J.K., et al.: Argoverse 2: Next generation datasets for self-driving perception and forecasting. arXiv preprint arXiv:2301.00493 (2023)

  48. [48]

    IEEE Intelligent Transportation Systems Magazine13(1), 91–106 (2020)

    Wong, K., Gu, Y ., Kamijo, S.: Mapping for autonomous driving: Opportunities and challenges. IEEE Intelligent Transportation Systems Magazine13(1), 91–106 (2020)

  49. [49]

    In: Proceed- ings of the IEEE conference on computer vision and pattern recognition

    Xia, G.S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., Pelillo, M., Zhang, L.: DOTA: A large-scale dataset for object detection in aerial images. In: Proceed- ings of the IEEE conference on computer vision and pattern recognition. pp. 3974–3983 (2018)

  50. [50]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: Vision trans- former with deformable attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 4794–4803 (2022)

  51. [51]

    ISPRS Journal of Pho- togrammetry and Remote Sensing190, 79–93 (2022)

    Xu, C., Wang, J., Yang, W., Yu, H., Yu, L., Xia, G.S.: De- tecting tiny objects in aerial images: A normalized Wasser- stein distance and a new benchmark. ISPRS Journal of Pho- togrammetry and Remote Sensing190, 79–93 (2022)

  52. [52]

    In: European conference on computer vision

    Xu, C., Wang, J., Yang, W., Yu, H., Yu, L., Xia, G.S.: Rfla: Gaussian receptive field based label assignment for tiny ob- ject detection. In: European conference on computer vision. pp. 526–543. Springer (2022)

  53. [53]

    In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition

    Yang, C., Huang, Z., Wang, N.: QueryDet: Cascaded sparse query for accelerating high-resolution small object detection. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. pp. 13668–13677 (2022)

  54. [54]

    In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

    Yu, X., Gong, Y ., Jiang, N., Ye, Q., Han, Z.: Scale match for tiny person detection. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 1257–1265 (2020)

  55. [55]

    IEEE access8, 58443–58469 (2020)

    Yurtsever, E., Lambert, J., Carballo, A., Takeda, K.: A sur- vey of autonomous driving: Common practices and emerg- ing technologies. IEEE access8, 58443–58469 (2020)

  56. [56]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., Shum, H.Y .: DINO: DETR with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022)

  57. [57]

    In: 2024 IEEE 4th In- ternational Conference on Digital Twins and Parallel Intelli- gence (DTPI)

    Zhao, Y ., Zhu, F., Mi, Y ., Chen, D., Xiong, G.: Simple-FPN: An image anomaly detection and localization network based on SimpleNet and feature pyramid. In: 2024 IEEE 4th In- ternational Conference on Digital Twins and Parallel Intelli- gence (DTPI). pp. 417–422. IEEE (2024)

  58. [58]

    IEEE transactions on neural networks and learning systems30(11), 3212–3232 (2019)

    Zhao, Z.Q., Zheng, P., Xu, S.t., Wu, X.: Object detection with deep learning: A review. IEEE transactions on neural networks and learning systems30(11), 3212–3232 (2019)

  59. [59]

    Objects as points,

    Zhou, X., Wang, D., Kr ¨ahenb¨uhl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)

  60. [60]

    Deformable DETR: Deformable Transformers for End-to-End Object Detection

    Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: De- formable DETR: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)

  61. [61]

    Zou, Z., Chen, K., Shi, Z., Guo, Y ., Ye, J.: Object detection in 20 years: A survey. Proceedings of the IEEE111(3), 257– 276 (2023) 11 Appendix Section A reports details on image-based object distance es- timation as well as distance-based statistics and informa- tion regarding the Argoverse 2 [47] autonomous driving dataset. Section B provides an additi...

  62. [62]

    Under the pinhole camera model, the object distancedcan be approximated as d≈ f Hc hp

    For the TruckDrive dataset, the focal length isf= 3304. Under the pinhole camera model, the object distancedcan be approximated as d≈ f Hc hp . (6) Percentage of Objects In ImageDistribution of Object Ranges 24.9% 24.0%34.5% 16.5% 81.6% 14.7% 1.1% 2.6% 0-50m 50-150m 150-250m ≥250m Percentage of Pixels In Image 0-50m 50-150m 150-250m 83.2% 1.1% 15.8% 1.6% ...

  63. [63]

    Ground truth annotations are shown on the left

    are strong general object detectors, but perform worse in long an ultra-long range object detection. Ground truth annotations are shown on the left. Zoomed-in views corresponding to the red rectangles are provided to highlight detections at long and ultra-long range, where some objects reach up to1km. All baselines are fine-tuned on the TruckDrive [16] da...