pith. machine review for the scientific record. sign in

arxiv: 2502.12524 · v1 · submitted 2025-02-18 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

YOLOv12: Attention-Centric Real-Time Object Detectors

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords real-time object detectionattention mechanismsYOLO frameworkobject detectorsinference latencyaccuracy comparisonCNN alternatives
0
0 comments X

The pith

YOLOv12 centers its architecture on attention mechanisms to exceed the accuracy of prior real-time object detectors while keeping inference speeds comparable to CNN-based YOLO models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes YOLOv12 as an attention-centric redesign of the YOLO framework. It demonstrates that attention mechanisms, long known for stronger modeling but previously too slow for real-time use, can be integrated to match the inference latency of CNN-based predecessors. On standard benchmarks, the smallest YOLOv12 variant reaches 40.6 percent mAP at 1.64 milliseconds on a T4 GPU, beating recent YOLOv10 and YOLOv11 versions by 2.1 and 1.2 percent mAP at similar speed. The gains hold across model sizes and extend to comparisons against end-to-end detectors such as RT-DETR, where YOLOv12 uses far fewer parameters and computations while running faster. This directly challenges the assumption that attention-based detectors must trade speed for accuracy in real-time settings.

Core claim

YOLOv12 is an attention-centric YOLO framework that matches the speed of previous CNN-based models while delivering higher accuracy, surpassing popular real-time detectors such as YOLOv10-N, YOLOv11-N, and RT-DETR variants on standard benchmarks.

What carries the argument

The attention-centric architectural changes in YOLOv12 that enable attention mechanisms to run at CNN-comparable speeds while retaining their modeling advantages.

If this is right

  • YOLOv12-N reaches 40.6 percent mAP at 1.64 ms inference latency on T4 GPU, exceeding YOLOv10-N and YOLOv11-N by 2.1 and 1.2 percent mAP.
  • YOLOv12-S runs 42 percent faster than RT-DETR-R18 while using 36 percent of the computation and 45 percent of the parameters.
  • The accuracy advantage holds across multiple model scales from nano to larger variants.
  • Attention mechanisms become viable as the primary backbone for real-time object detection without custom hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers of other real-time vision systems may shift priority from CNN blocks to attention blocks once speed parity is shown feasible.
  • The result suggests that targeted architectural tuning can close the efficiency gap between attention and convolution in latency-sensitive tasks.
  • Future work could test whether the same attention-centric pattern transfers to related problems such as real-time instance segmentation or video object tracking.

Load-bearing premise

The specific attention mechanisms and any accompanying optimizations can be implemented to run at speeds matching CNN-based YOLO models on standard hardware.

What would settle it

A side-by-side benchmark on COCO showing YOLOv12 achieving lower mAP than YOLOv11-N at equal or higher latency on a T4 GPU would falsify the central performance claim.

read the original abstract

Enhancing the network architecture of the YOLO framework has been crucial for a long time, but has focused on CNN-based improvements despite the proven superiority of attention mechanisms in modeling capabilities. This is because attention-based models cannot match the speed of CNN-based models. This paper proposes an attention-centric YOLO framework, namely YOLOv12, that matches the speed of previous CNN-based ones while harnessing the performance benefits of attention mechanisms. YOLOv12 surpasses all popular real-time object detectors in accuracy with competitive speed. For example, YOLOv12-N achieves 40.6% mAP with an inference latency of 1.64 ms on a T4 GPU, outperforming advanced YOLOv10-N / YOLOv11-N by 2.1%/1.2% mAP with a comparable speed. This advantage extends to other model scales. YOLOv12 also surpasses end-to-end real-time detectors that improve DETR, such as RT-DETR / RT-DETRv2: YOLOv12-S beats RT-DETR-R18 / RT-DETRv2-R18 while running 42% faster, using only 36% of the computation and 45% of the parameters. More comparisons are shown in Figure 1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces YOLOv12, an attention-centric real-time object detector that replaces or augments CNN components with attention mechanisms while claiming to retain CNN-comparable inference speeds. It reports that YOLOv12-N achieves 40.6% mAP at 1.64 ms latency on T4 GPU, outperforming YOLOv10-N and YOLOv11-N by 2.1% and 1.2% mAP respectively, with similar advantages across scales and against RT-DETR variants in speed, compute, and parameters.

Significance. If the efficiency claims hold, the result would be significant for real-time detection by showing that attention can deliver measurable accuracy gains without the usual quadratic latency penalty, potentially shifting design paradigms away from pure CNN backbones. The concrete benchmark numbers and cross-family comparisons provide falsifiable predictions that could be directly tested on standard hardware.

major comments (2)
  1. [Abstract and Section 3] Abstract and architecture description: the central claim that attention-centric modules achieve 1.64 ms latency on T4 for the N-scale model while improving mAP requires an explicit complexity analysis (FLOPs scaling, windowed/linear attention formulation, or FlashAttention integration) showing how quadratic costs are eliminated; without this, it is unclear whether the reported speed derives from the attention design or from unstated CNN fallbacks or resolution reductions.
  2. [Experiments] Experiments section: the 2.1%/1.2% mAP gains over YOLOv10-N/YOLOv11-N and the 42% speed advantage over RT-DETR-R18 are load-bearing for the 'surpasses all popular real-time detectors' claim, yet no details are supplied on whether all models use identical training schedules, augmentation pipelines, or input resolutions; this prevents verification that the gains are attributable to the attention-centric changes rather than training differences.
minor comments (2)
  1. [Figure 1] Figure 1 caption and latency table: confirm that all reported latencies use the same T4 GPU, batch size 1, and FP16/INT8 precision to ensure apples-to-apples comparison.
  2. [Section 3] Notation for model scales (N/S/M/L/X): explicitly define how the attention module widths and depths scale with these variants to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our YOLOv12 manuscript. We have revised the paper to incorporate explicit complexity analysis and experimental protocol details, addressing the concerns while preserving the core contributions.

read point-by-point responses
  1. Referee: [Abstract and Section 3] Abstract and architecture description: the central claim that attention-centric modules achieve 1.64 ms latency on T4 for the N-scale model while improving mAP requires an explicit complexity analysis (FLOPs scaling, windowed/linear attention formulation, or FlashAttention integration) showing how quadratic costs are eliminated; without this, it is unclear whether the reported speed derives from the attention design or from unstated CNN fallbacks or resolution reductions.

    Authors: We agree that an explicit complexity analysis strengthens the validation of our efficiency claims. In the revised manuscript, Section 3 now includes a dedicated complexity analysis subsection. It details the FLOPs scaling for the attention modules, the windowed and linear attention formulations that achieve linear complexity, and the FlashAttention integration used to eliminate quadratic costs. This confirms that the reported 1.64 ms latency on T4 for YOLOv12-N arises directly from the attention-centric design without CNN fallbacks or resolution reductions. revision: yes

  2. Referee: [Experiments] Experiments section: the 2.1%/1.2% mAP gains over YOLOv10-N/YOLOv11-N and the 42% speed advantage over RT-DETR-R18 are load-bearing for the 'surpasses all popular real-time detectors' claim, yet no details are supplied on whether all models use identical training schedules, augmentation pipelines, or input resolutions; this prevents verification that the gains are attributable to the attention-centric changes rather than training differences.

    Authors: We acknowledge the need for transparent experimental details to ensure fair comparisons. The revised Experiments section now includes an explicit subsection describing the training protocols. All compared models (YOLOv10-N, YOLOv11-N, RT-DETR variants) were trained and evaluated using identical schedules, augmentation pipelines, and input resolutions as defined in their original papers and the standard COCO benchmark settings. This confirms that the mAP and speed gains are attributable to YOLOv12's attention-centric architecture. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical architecture proposal with benchmark results

full rationale

The paper introduces YOLOv12 as an attention-centric YOLO variant and supports its claims solely through empirical benchmark comparisons (e.g., mAP and latency numbers on T4 GPU against YOLOv10/YOLOv11 and RT-DETR variants). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claims rest on experimental outcomes rather than any reduction to inputs by construction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work is an empirical architecture paper; it relies on standard deep-learning assumptions about attention superiority and benchmark validity rather than new axioms or invented physical entities.

free parameters (1)
  • model scale definitions (N/S/M/L/X)
    Specific channel counts, layer depths, and block configurations per scale are chosen to balance speed and accuracy.
axioms (1)
  • domain assumption Attention mechanisms have superior modeling capabilities compared with CNNs
    Stated directly in the abstract as a premise for the design shift.

pith-pipeline@v0.9.0 · 5526 in / 1281 out tokens · 39861 ms · 2026-05-13T21:30:28.301722+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects

    cs.CV 2026-04 unverdicted novelty 7.0

    WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.

  2. SARES-DEIM: Sparse Mixture-of-Experts Meets DETR for Robust SAR Ship Detection

    cs.CV 2026-04 unverdicted novelty 7.0

    SARES-DEIM achieves 76.4% mAP50:95 and 93.8% mAP50 on HRSID by routing SAR features through sparse frequency and wavelet experts plus a high-resolution preservation neck, outperforming prior YOLO and SAR detectors.

  3. AnyDepth-DETR/-YOLO: Any-depth object detection with a single network

    cs.CV 2026-05 unverdicted novelty 6.0

    A single network achieves any-depth object detection by splitting stages into always-executed essential paths and skippable refinement paths, trained via self-distillation on the full and minimal extremes to maintain ...

  4. Training-Free Tunnel Defect Inspection and Engineering Interpretation via Visual Recalibration and Entity Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    TunnelMIND recalibrates language-guided defect proposals via dense visual consistency and reconstructs them into structured defect entities with attributes for severity grading and retrieval-grounded engineering repor...

  5. Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection

    cs.CV 2026-04 unverdicted novelty 6.0

    UAVGen generates higher-quality synthetic UAV images via visual prototype conditioning and focal region focus in diffusion models, leading to better object detection accuracy than prior methods.

  6. TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion

    cs.CV 2026-05 unverdicted novelty 5.0

    TriBand-BEV introduces a three-band height-aware BEV encoding of LiDAR data to enable single-pass real-time 3D detection of pedestrians, cars, and cyclists with improved KITTI accuracy.

  7. Cooperative Robotics Reinforced by Collective Perception for Traffic Moderation

    cs.RO 2026-05 unverdicted novelty 5.0

    A cooperative humanoid robot fuses camera-based collective perception with V2X messages to detect collision risks at non-line-of-sight intersections and physically stops merging vehicles.

  8. InsHuman: Towards Natural and Identity-Preserving Human Insertion

    cs.CV 2026-05 unverdicted novelty 5.0

    InsHuman proposes Human-Background Adaptive Fusion, Face-to-Face ID-Preserving, and Bidirectional Data Pairing to enable natural human insertion in images without altering identity.

  9. LLM-Guided Agentic Floor Plan Parsing for Accessible Indoor Navigation of Blind and Low-Vision People

    cs.AI 2026-04 unverdicted novelty 5.0

    A self-correcting multi-agent LLM pipeline parses floor plans into graphs and generates accessible routes, outperforming single LLM calls with success rates up to 92% on short paths in a real university building.

  10. Caries DETR: Tooth Structure-aware Prior and Lesion-aware Dynamic Loss Refinement for DETR Based Caries Detection

    cs.CV 2026-04 unverdicted novelty 5.0

    Caries-DETR adds tooth-structure query initialization and lesion-aware loss reweighting to DETR, reaching state-of-the-art caries detection on AlphaDent and DentalAI datasets.

  11. StomaD2: An All-in-One System for Intelligent Stomatal Phenotype Analysis via Diffusion-Based Restoration Detection Network

    cs.CV 2026-04 unverdicted novelty 5.0

    StomaD2 integrates diffusion-based image restoration with a specialized rotated detection network to achieve high-accuracy stomatal phenotyping across more than 130 plant species.

  12. A Weak-Signal-Aware Framework for Subsurface Defect Detection: Mechanisms for Enhancing Low-SCR Hyperbolic Signatures

    cs.CV 2026-04 unverdicted novelty 5.0

    WSA-Net uses partial convolutions, heterogeneous grouping attention, geometric reconstruction, and context anchoring to enhance low-SCR hyperbolic signatures in GPR data, reaching 0.6958 mAP@0.5 at 164 FPS with 2.412M...

  13. A Marine Debris Detection Framework for Ocean Robots via Self-Attention Enhancement and Feature Interaction Optimization

    cs.CV 2026-05 unverdicted novelty 4.0

    YOLO-MD improves underwater marine debris detection by adding a Dual-Branch Convolutional Enhanced Self-Attention module, a lightweight shift operation, and SFG-Loss for class imbalance, achieving 0.875 precision and ...

  14. Resource-Constrained UAV-Based Weed Detection for Site-Specific Management on Edge Devices

    cs.CV 2026-04 unverdicted novelty 4.0

    YOLOv11s and RT-DETRv2-R50-M provide the best accuracy-speed trade-off for real-time weed detection on edge UAV systems, with mAP50 up to 79% and low latency.

  15. Early Detection of Acute Myeloid Leukemia (AML) Using YOLOv12 Deep Learning Model

    cs.CV 2026-04 unverdicted novelty 4.0

    YOLOv12 with Otsu thresholding on cell-based segmentation classifies AML cells at 99.3% validation and test accuracy.

  16. FSDETR: Frequency-Spatial Feature Enhancement for Small Object Detection

    cs.CV 2026-04 unverdicted novelty 4.0

    FSDETR enhances RT-DETR with SHAB, DA-AIFI, and FSFPN blocks to improve small-object detection, reporting 13.9% APS on VisDrone 2019 and 48.95% AP50 on TinyPerson using 14.7M parameters.

  17. Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection

    cs.CV 2026-04 unverdicted novelty 4.0

    MDDCNet combines Mamba blocks with deformable dilated convolutions, enhanced feed-forward networks, and an attention-aggregating feature pyramid to achieve better multi-scale traffic object detection than prior detectors.

  18. DAT: Dual-Aware Adaptive Transmission for Efficient Multimodal LLM Inference in Edge-Cloud Systems

    cs.MM 2026-04 unverdicted novelty 4.0

    DAT combines a small-large model cascade with fine-tuning and bandwidth-aware multi-stream transmission to deliver high-accuracy event recognition and low-latency alerts for video streams in edge-cloud systems.

  19. Depth-Aware Rover: A Study of Edge AI and Monocular Vision for Real-World Implementation

    cs.CV 2026-04 unverdicted novelty 3.0

    Monocular depth estimation with UniDepthV2 on Raspberry Pi enables cost-effective rover navigation, proving more robust than stereo vision in real-world tests at 0.1 FPS depth and 10 FPS detection.

  20. Real-Time Cellist Postural Evaluation With On-Device Computer Vision

    cs.HC 2026-04 unverdicted novelty 3.0

    Cello Evaluator is a real-time postural feedback system for cellists running on current Android phones via on-device computer vision, validated as user-friendly by experts.

  21. Multi-Agent Object Detection Framework Based on Raspberry Pi YOLO Detector and Slack-Ollama Natural Language Interface

    cs.CV 2026-04 unverdicted novelty 3.0

    A local multi-agent framework integrates YOLO object detection with Slack-Ollama natural language control entirely on Raspberry Pi hardware.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 21 Pith papers · 10 internal anchors

  1. [1]

    BEiT: BERT Pre-Training of Image Transformers

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021. 6, 9

  2. [2]

    Low-rank bottleneck in multi-head attention models

    Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Low-rank bottleneck in multi-head attention models. In International conference on machine learning, pages 864–873. PMLR, 2020. 4

  3. [3]

    YOLOv4: Optimal Speed and Accuracy of Object Detection

    Alexey Bochkovskiy, Chien-Yao Wang, and Hong- Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020. 1, 2, 6, 11

  4. [4]

    Anomaly detection in autonomous driving: A survey

    Daniel Bogdoll, Maximilian Nitsche, and J Marius Z ¨ollner. Anomaly detection in autonomous driving: A survey. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4488–4499, 2022. 1

  5. [5]

    Lan- guage models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural in- formation processing systems, 33:1877–1901, 2020. 3

  6. [6]

    Albumentations: fast and flexible image augmenta- tions

    Alexander Buslaev, Vladimir I Iglovikov, Eugene Khved- chenya, Alex Parinov, Mikhail Druzhinin, and Alexandr A Kalinin. Albumentations: fast and flexible image augmenta- tions. Information, 11(2):125, 2020. 11

  7. [7]

    End-to- end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. In European confer- ence on computer vision, pages 213–229. Springer, 2020. 2

  8. [8]

    Ap-loss for accurate one-stage object detection

    Kean Chen, Weiyao Lin, Jianguo Li, John See, Ji Wang, and Junni Zou. Ap-loss for accurate one-stage object detection. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 43(11):3782–3798, 2020. 1

  9. [9]

    Yolo-ms: rethinking multi- scale representation learning for real-time object detection

    Yuming Chen, Xinbin Yuan, Ruiqi Wu, Jiabao Wang, Qibin Hou, and Ming-Ming Cheng. Yolo-ms: rethinking multi- scale representation learning for real-time object detection. arXiv preprint arXiv:2308.05480, 2023. 2

  10. [11]

    Rethinking Attention with Performers

    Krzysztof Choromanski, Valerii Likhosherstov, David Do- han, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020. 3, 4

  11. [12]

    Twins: Revisiting the design of spatial attention in vision transformers

    Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haib- ing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems, 34:9355–9366, 2021. 3

  12. [13]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023. 2, 3, 7, 11

  13. [14]

    Flashattention: Fast and memory-efficient exact at- tention with io-awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo- pher R´e. Flashattention: Fast and memory-efficient exact at- tention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022. 2, 3, 7, 11

  14. [15]

    BERT: pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional trans- formers for language understanding. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186, 2019. 3

  15. [16]

    Cswin transformer: A general vision transformer backbone with cross-shaped windows

    Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12124–12134, 2022. 2, 4

  16. [17]

    Mobile robot navigation using an object recogni- tion software with rgbd images and the yolo algorithm

    Douglas Henke Dos Reis, Daniel Welfer, Marco Anto- nio De Souza Leite Cuadros, and Daniel Fernando Tello Gamarra. Mobile robot navigation using an object recogni- tion software with rgbd images and the yolo algorithm. Ap- plied Artificial Intelligence, 33(14):1290–1305, 2019. 1

  17. [18]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 6

  18. [19]

    Eva: Exploring the limits of masked visual representa- tion learning at scale

    Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representa- tion learning at scale. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023. 3, 6

  19. [21]

    Eva-02: A visual representation for neon genesis

    Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin- long Wang, and Yue Cao. Eva-02: A visual representation for neon genesis. Image and Vision Computing, 149:105171,

  20. [22]

    Tood: Task-aligned one-stage object detec- tion

    Chengjian Feng, Yujie Zhong, Yu Gao, Matthew R Scott, and Weilin Huang. Tood: Task-aligned one-stage object detec- tion. In 2021 IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 3490–3499. IEEE Computer So- ciety, 2021. 1

  21. [23]

    Ota: Optimal transport assignment for object detection

    Zheng Ge, Songtao Liu, Zeming Li, Osamu Yoshie, and Jian Sun. Ota: Optimal transport assignment for object detection. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 303–312, 2021. 1

  22. [24]

    Jocher Glenn. Yolov8. https://github.com/ultralytics/ultralytics/tree/main, 2023. 1, 2, 5, 6, 9, 11

  23. [25]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 16000– 16009, 2022. 1, 6, 9

  24. [26]

    Ax- ial attention in multidimensional transformers,

    Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180, 2019. 2

  25. [27]

    Ccnet: Criss-cross attention for semantic segmentation

    Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 603–612, 2019. 2, 4

  26. [28]

    Glenn Jocher. yolov11. https://github.com/ultralytics, 2024. 1, 2, 4, 5, 6, 7, 8, 9, 10, 11

  27. [29]

    Glenn Jocher, K Nishimura, T Mineeva, and RJAM Vilari˜no. yolov5. https://github.com/ultralytics/yolov5/tree, 2, 2020. 1, 2, 6

  28. [30]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International confer- ence on machine learning, pages 5156–5165. PMLR, 2020. 4

  29. [31]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International confer- ence on machine learning, pages 5156–5165. PMLR, 2020. 3

  30. [32]

    Yolov6 v3

    Chuyi Li, Lulu Li, Yifei Geng, Hongliang Jiang, Meng Cheng, Bo Zhang, Zaidan Ke, Xiaoming Xu, and Xiangxi- ang Chu. Yolov6 v3. 0: A full-scale reloading.arXiv preprint arXiv:2301.05586, 2023. 1, 2, 5, 6

  31. [33]

    Dn-detr: Accelerate detr training by intro- ducing query denoising

    Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. Dn-detr: Accelerate detr training by intro- ducing query denoising. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 13619–13627, 2022. 2

  32. [34]

    A dual weighting label assignment scheme for object detection

    Shuai Li, Chenhang He, Ruihuang Li, and Lei Zhang. A dual weighting label assignment scheme for object detection. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 9387–9396, 2022. 1

  33. [35]

    Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection

    Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu, Jun Li, Jinhui Tang, and Jian Yang. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Advances in Neural Information Processing Systems, 33:21002–21012, 2020. 1

  34. [36]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 6, 10

  35. [37]

    Dab-detr: Dynamic anchor boxes are better queries for detr,

    Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329, 2022. 2

  36. [38]

    Vmamba: Visual state space model

    Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. Vmamba: Visual state space model. In The Thirty- eighth Annual Conference on Neural Information Processing Systems, 2024. 2, 3

  37. [39]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 2, 3, 4

  38. [41]

    Rt-detrv2: Improved base- line with bag-of-freebies for real-time detection transformer

    Wenyu Lv, Yian Zhao, Qinyao Chang, Kui Huang, Guanzhong Wang, and Yi Liu. Rt-detrv2: Improved base- line with bag-of-freebies for real-time detection transformer. arXiv preprint arXiv:2407.17140, 2024. 5, 6

  39. [42]

    Conditional detr for fast training convergence

    Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Conditional detr for fast training convergence. In Proceed- ings of the IEEE/CVF international conference on computer vision, pages 3651–3660, 2021. 2

  40. [43]

    A ranking-based, balanced loss function unifying classification and localisation in object detection

    Kemal Oksuz, Baris Can Cam, Emre Akbas, and Sinan Kalkan. A ranking-based, balanced loss function unifying classification and localisation in object detection. Advances in Neural Information Processing Systems, 33:15534–15545,

  41. [44]

    Rank & sort loss for object detection and instance segmentation

    Kemal Oksuz, Baris Can Cam, Emre Akbas, and Sinan Kalkan. Rank & sort loss for object detection and instance segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3009–3018, 2021. 1

  42. [45]

    You only look once: Unified, real-time object de- tection

    J Redmon. You only look once: Unified, real-time object de- tection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016. 1, 2, 6

  43. [46]

    YOLOv3: An Incremental Improvement

    Joseph Redmon. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018

  44. [47]

    Yolo9000: better, faster, stronger

    Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. InProceedings of the IEEE conference on computer vision and pattern recognition , pages 7263–7271, 2017. 1, 2, 6

  45. [48]

    Generalized in- tersection over union: A metric and a loss for bounding box regression

    Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666,

  46. [49]

    Efficient attention: Attention with lin- ear complexities

    Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. Efficient attention: Attention with lin- ear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3531– 3539, 2021. 3, 4

  47. [50]

    Fast-itpn: Integrally pre- trained transformer pyramid network with token migration

    Yunjie Tian, Lingxi Xie, Jihao Qiu, Jianbin Jiao, Yaowei Wang, Qi Tian, and Qixiang Ye. Fast-itpn: Integrally pre- trained transformer pyramid network with token migration. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2024. 1, 3

  48. [51]

    Training data-efficient image transformers & distillation through at- tention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. In International conference on machine learning , pages 10347–10357. PMLR, 2021. 6

  49. [52]

    Going deeper with im- age transformers

    Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv´e J´egou. Going deeper with im- age transformers. In Proceedings of the IEEE/CVF interna- tional conference on computer vision, pages 32–42, 2021. 4

  50. [53]

    Yolov10: Real-time end-to-end object detection.arXiv preprint arXiv:2405.14458, 2024

    Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jun- gong Han, and Guiguang Ding. Yolov10: Real-time end- to-end object detection. arXiv preprint arXiv:2405.14458 ,

  51. [54]

    1, 2, 5, 6, 7, 8, 9, 10, 11

  52. [55]

    Gold-yolo: Ef- ficient object detector via gather-and-distribute mechanism

    Chengcheng Wang, Wei He, Ying Nie, Jianyuan Guo, Chuanjian Liu, Yunhe Wang, and Kai Han. Gold-yolo: Ef- ficient object detector via gather-and-distribute mechanism. Advances in Neural Information Processing Systems , 36,

  53. [56]

    Cspnet: A new backbone that can enhance learning capability of cnn

    Chien-Yao Wang, Hong-Yuan Mark Liao, Yueh-Hua Wu, Ping-Yang Chen, Jun-Wei Hsieh, and I-Hau Yeh. Cspnet: A new backbone that can enhance learning capability of cnn. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages 390–391,

  54. [57]

    Designing network design strategies through gradient path analysis

    Chien-Yao Wang, Hong-Yuan Mark Liao, and I-Hau Yeh. Designing network design strategies through gradient path analysis. arXiv preprint arXiv:2211.04800, 2022. 2, 4

  55. [58]

    Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

    Chien-Yao Wang, Alexey Bochkovskiy, and Hong- Yuan Mark Liao. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7464–7475, 2023. 1, 2, 4, 6, 11

  56. [59]

    Yolov9: Learning what you want to learn us- ing programmable gradient information

    Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. Yolov9: Learning what you want to learn us- ing programmable gradient information. arXiv preprint arXiv:2402.13616, 2024. 1, 2, 4, 5, 6, 7, 8, 9, 11

  57. [60]

    End-to-end object detection with fully convolutional network

    Jianfeng Wang, Lin Song, Zeming Li, Hongbin Sun, Jian Sun, and Nanning Zheng. End-to-end object detection with fully convolutional network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15849–15858, 2021. 1

  58. [61]

    Linformer: Self-Attention with Linear Complexity

    Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020. 3, 4

  59. [62]

    Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

    Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision , pages 568–578, 2021. 2

  60. [63]

    Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer

    Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer. arXiv preprint arXiv:2501.18427, 2025. 3

  61. [64]

    Nystr¨omformer: A nystr¨om-based algorithm for approximat- ing self-attention

    Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nystr¨omformer: A nystr¨om-based algorithm for approximat- ing self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 14138–14148, 2021. 4

  62. [65]

    Glance-and-gaze vision transformer

    Qihang Yu, Yingda Xia, Yutong Bai, Yongyi Lu, Alan L Yuille, and Wei Shen. Glance-and-gaze vision transformer. Advances in Neural Information Processing Systems , 34: 12992–13003, 2021. 3

  63. [66]

    mixup: Beyond Empirical Risk Minimization

    Hongyi Zhang. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017. 11

  64. [67]

    Detrs beat yolos on real-time object detection

    Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. Detrs beat yolos on real-time object detection. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16965–16974, 2024. 2, 5, 6

  65. [68]

    Distance-iou loss: Faster and bet- ter learning for bounding box regression

    Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, and Dongwei Ren. Distance-iou loss: Faster and bet- ter learning for bounding box regression. In Proceedings of the AAAI conference on artificial intelligence, pages 12993– 13000, 2020. 1

  66. [69]

    Iou loss for 2d/3d ob- ject detection

    Dingfu Zhou, Jin Fang, Xibin Song, Chenye Guan, Junbo Yin, Yuchao Dai, and Ruigang Yang. Iou loss for 2d/3d ob- ject detection. In 2019 international conference on 3D vision (3DV), pages 85–94. IEEE, 2019. 1

  67. [70]

    Autoassign: Differ- entiable label assignment for dense object detection

    Benjin Zhu, Jianfeng Wang, Zhengkai Jiang, Fuhang Zong, Songtao Liu, Zeming Li, and Jian Sun. Autoassign: Differ- entiable label assignment for dense object detection. arXiv preprint arXiv:2007.03496, 2020. 1

  68. [71]

    Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

    Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024. 3

  69. [72]

    Deformable DETR: Deformable Transformers for End-to-End Object Detection

    Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable trans- formers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020. 2, 11