pith. machine review for the scientific record. sign in

arxiv: 2004.10934 · v1 · submitted 2020-04-23 · 💻 cs.CV · eess.IV

Recognition: 1 theorem link

· Lean Theorem

YOLOv4: Optimal Speed and Accuracy of Object Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-12 14:29 UTC · model grok-4.3

classification 💻 cs.CV eess.IV
keywords object detectionYOLOreal-time inferenceconvolutional neural networksdata augmentationactivation functionsbounding box regressionMS COCO dataset
0
0 comments X

The pith

Combining eight features including Mish activation and mosaic augmentation yields a detector with 43.5 percent AP on MS COCO at 65 frames per second.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines many proposed improvements to convolutional networks for object detection and selects those that appear to work across different models and large datasets. It integrates Weighted-Residual-Connections, Cross-Stage-Partial connections, Cross mini-Batch Normalization, Self-adversarial training, Mish activation, Mosaic data augmentation, DropBlock regularization, and CIoU loss into a single architecture called YOLOv4. This produces state-of-the-art accuracy on the MS COCO benchmark while preserving real-time inference speed on current GPUs. A sympathetic reader would care because object detection systems must deliver both high precision and low latency to be useful in video streams, autonomous systems, and other time-sensitive settings.

Core claim

The authors demonstrate that a specific collection of features—Weighted-Residual-Connections, Cross-Stage-Partial-connections, Cross mini-Batch Normalization, Self-adversarial-training, Mish activation, Mosaic data augmentation, DropBlock regularization, and CIoU loss—can be combined inside the YOLO framework to reach 43.5 percent AP and 65.7 percent AP50 on the MS COCO dataset at approximately 65 frames per second on a Tesla V100 GPU.

What carries the argument

The YOLOv4 detector formed by attaching the eight listed features to a CSPDarknet53 backbone, PANet feature pyramid, and YOLOv3-style detection head.

If this is right

  • Real-time detectors can now operate at accuracy levels previously available only from slower models.
  • The same feature set can be inserted into other one-stage detectors to obtain similar speed-accuracy trade-offs.
  • Mosaic augmentation and CIoU loss together improve training stability and final bounding-box precision.
  • DropBlock regularization and self-adversarial training raise generalization with negligible extra cost at inference time.
  • Empirical testing of feature combinations on large benchmarks can outperform purely theoretical design choices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future object-detection papers may treat the YOLOv4 feature set as a new baseline rather than starting from earlier YOLO versions.
  • The universal features could transfer to related tasks such as instance segmentation or video object tracking.
  • On lower-power hardware the reported speed margin may allow models to run at higher input resolutions than previously practical.
  • Training pipelines that adopt Mosaic and CIoU may reduce the need for extensive hyper-parameter searches.

Load-bearing premise

The chosen features will combine without harmful interactions and will deliver comparable gains on other large-scale detection datasets.

What would settle it

A controlled experiment on the Open Images dataset that applies the same feature set and training schedule but records less than a 3-point AP gain relative to the prior YOLOv3 baseline at equivalent speed.

read the original abstract

There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100. Source code is at https://github.com/AlexeyAB/darknet

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript describes YOLOv4, an object detection model that integrates several features—Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT), Mish activation, Mosaic data augmentation, DropBlock regularization, and CIoU loss—into the YOLO framework. It reports state-of-the-art results of 43.5% AP (65.7% AP50) on the MS COCO dataset at a real-time speed of ~65 FPS on Tesla V100, with the source code made available.

Significance. If the reported results hold under the provided ablations and code, this work offers a significant practical advance in real-time object detection by demonstrating an effective, reproducible combination of architectural and training techniques that improves the speed-accuracy trade-off over prior YOLO versions on a large-scale benchmark.

minor comments (2)
  1. Abstract: the list of new features repeats 'CmBN' twice, which appears to be a typographical error.
  2. Abstract: the text states that the authors 'combine some of them' to achieve the final result but does not explicitly identify the exact subset used for the reported 43.5% AP model; the body of the paper should make this mapping unambiguous.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the work and for recommending minor revision. The review correctly identifies the core contribution as an effective, reproducible combination of techniques that advances the speed-accuracy trade-off for real-time detection on MS COCO. We have prepared a revised manuscript that incorporates all minor suggestions implied by the recommendation.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is an empirical engineering paper that enumerates a set of architectural and training modifications (WRC, CSP, CmBN, SAT, Mish, Mosaic, DropBlock, CIoU) and reports their measured effect on MS COCO AP via ablation tables. No equations, fitted parameters, or uniqueness theorems are presented whose outputs are definitionally identical to their inputs. The central numeric claim (43.5 % AP at ~65 FPS) is therefore an experimental observation rather than a self-referential derivation; the linked repository and incremental ablations render the result externally checkable.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that a particular set of modular improvements can be combined without destructive interference on the COCO benchmark; no new physical or mathematical axioms are introduced.

free parameters (1)
  • Training hyperparameters and feature-selection choices
    Standard ML practice; the abstract does not enumerate the exact values or search procedure used to arrive at the final combination.
axioms (1)
  • domain assumption Features such as batch-normalization and residual connections are universal across models, tasks, and datasets
    Explicitly stated in the abstract as the basis for selecting WRC, CSP, CmBN, SAT, and Mish.

pith-pipeline@v0.9.0 · 5511 in / 1429 out tokens · 37413 ms · 2026-05-12T14:29:03.443017+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Floating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients

    cs.LG 2026-05 unverdicted novelty 8.0

    Floating-point neural networks with automatic differentiation can represent arbitrary floating-point functions and their gradients under mild conditions.

  2. SoK: The Next Frontier in AV Security: Systematizing Perception Attacks and the Emerging Threat of Multi-Sensor Fusion

    cs.CR 2026-04 unverdicted novelty 7.0

    The paper organizes perception attacks on AVs into a new taxonomy, identifies gaps in fusion-aware defenses, and validates one cross-sensor vulnerability with a proof-of-concept simulation.

  3. Adaptive Slicing-Assisted Hyper Inference for Enhanced Small Object Detection in High-Resolution Imagery

    cs.CV 2026-04 unverdicted novelty 7.0

    ASAHI adaptively slices high-res images into 6 or 12 patches, adds slicing-assisted fine-tuning, and uses Cluster-DIoU-NMS to hit 56.8% mAP on VisDrone2019 and 22.7% on xView while running 20-25% faster than fixed sli...

  4. Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection

    cs.CV 2026-04 unverdicted novelty 7.0

    HELP uses heatmap-guided positional embeddings and a gradient mask to suppress background noise in queries, enabling efficient small-object detection with fewer decoder layers and parameters.

  5. WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects

    cs.CV 2026-04 unverdicted novelty 7.0

    WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.

  6. AnyDepth-DETR/-YOLO: Any-depth object detection with a single network

    cs.CV 2026-05 unverdicted novelty 6.0

    A single network achieves any-depth object detection by splitting stages into always-executed essential paths and skippable refinement paths, trained via self-distillation on the full and minimal extremes to maintain ...

  7. Transferable Physical-World Adversarial Patches Against Pedestrian Detection Models

    cs.CV 2026-04 unverdicted novelty 6.0

    TriPatch generates transferable physical adversarial patches via multi-stage triplet loss, appearance consistency, and data augmentation to achieve higher attack success rates on pedestrian detectors than prior methods.

  8. Cross-Modal Phantom: Coordinated Camera-LiDAR Spoofing Against Multi-Sensor Fusion in Autonomous Vehicles

    cs.CR 2026-04 unverdicted novelty 6.0

    Simulated coordinated IR and LiDAR spoofing achieves 85.5% success deceiving MSF perception on 400 KITTI scenes by creating consistent false 3D objects.

  9. FlowExtract: Procedural Knowledge Extraction from Maintenance Flowcharts

    cs.CV 2026-04 unverdicted novelty 6.0

    FlowExtract extracts directed graphs from ISO 5807 flowcharts via YOLOv8 node detection and arrowhead-based edge tracing, outperforming vision-language models on connectivity reconstruction.

  10. ComPrivDet: Efficient Privacy Object Detection in Compressed Domains Through Inference Reuse

    cs.CV 2026-04 unverdicted novelty 6.0

    ComPrivDet detects privacy objects in compressed videos by reusing I-frame inferences and skipping over 80% of detections while maintaining over 96% accuracy.

  11. SFFNet: Synergistic Feature Fusion Network With Dual-Domain Edge Enhancement for UAV Image Object Detection

    cs.CV 2026-04 unverdicted novelty 6.0

    SFFNet uses multi-scale dynamic dual-domain coupling and a synergistic feature pyramid network to reach 36.8 AP on VisDrone and 20.6 AP on UAVDT for UAV object detection.

  12. YOLOv12: Attention-Centric Real-Time Object Detectors

    cs.CV 2025-02 unverdicted novelty 6.0

    YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.

  13. Inner Monologue: Embodied Reasoning through Planning with Language Models

    cs.RO 2022-07 unverdicted novelty 6.0

    LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.

  14. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    cs.CV 2022-03 conditional novelty 6.0

    DINO reaches 51.3 AP on COCO val2017 with a ResNet-50 backbone after 24 epochs, a +2.7 AP gain over the prior best DETR variant.

  15. YOLOX: Exceeding YOLO Series in 2021

    cs.CV 2021-07 accept novelty 6.0

    YOLOX exceeds prior YOLO models by adopting anchor-free detection, decoupled heads, and SimOTA assignment to reach 50.0% AP on COCO for the large variant.

  16. DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer

    cs.CV 2026-05 unverdicted novelty 5.0

    DetRefiner fuses global and local features with a Transformer to refine OVOD confidence scores, delivering up to +10.1 AP gains on novel categories across multiple datasets.

  17. Investigation of cardinality classification for bacterial colony counting using explainable artificial intelligence

    cs.CV 2026-04 unverdicted novelty 5.0

    XAI analysis identifies high visual similarity across colony cardinality classes as the primary limit on MicrobiaNet performance in bacterial colony counting, revising prior model assessments.

  18. RareSpot+: A Benchmark, Model, and Active Learning Framework for Small and Rare Wildlife in Aerial Imagery

    cs.CV 2026-04 unverdicted novelty 5.0

    RareSpot+ boosts small-object detection mAP by 0.13 on aerial wildlife data and cuts annotation needs to 1.7% of tiles via consistency losses and spatial priors.

  19. LiDAR-based Crowd Navigation with Visible Edge Group Representation

    cs.RO 2026-04 unverdicted novelty 5.0

    A simplified visible edge group representation enables robot crowd navigation that matches prior methods in safety and socialness while running faster in dense settings.

  20. CollideNet: Hierarchical Multi-scale Video Representation Learning with Disentanglement for Time-To-Collision Forecasting

    cs.CV 2026-04 unverdicted novelty 5.0

    CollideNet achieves state-of-the-art time-to-collision forecasting on three public datasets by combining multi-scale spatial aggregation with temporal disentanglement of trend and seasonality in a hierarchical transformer.

  21. Edge Deep Learning in Computer Vision and Medical Diagnostics: A Comprehensive Survey

    cs.CV 2026-05 unverdicted novelty 4.0

    A comprehensive survey of edge deep learning in computer vision and medical diagnostics that presents a novel categorization of hardware platforms by performance and usage scenarios.

  22. SynSur: An end-to-end generative pipeline for synthetic industrial surface defect generation and detection

    cs.CV 2026-04 unverdicted novelty 4.0

    A generative pipeline creates realistic synthetic pitting defects and other surface flaws that, when added to real training data, yield modest gains in industrial defect detectors without replacing the need for authen...

  23. Resource-Constrained UAV-Based Weed Detection for Site-Specific Management on Edge Devices

    cs.CV 2026-04 unverdicted novelty 4.0

    YOLOv11s and RT-DETRv2-R50-M provide the best accuracy-speed trade-off for real-time weed detection on edge UAV systems, with mAP50 up to 79% and low latency.

  24. Learning to count small and clustered objects with application to bacterial colonies

    cs.CV 2026-04 unverdicted novelty 4.0

    ACFamNet Pro reaches 9.64% mean normalized absolute error on bacterial colony images under 5-fold cross-validation, beating FamNet by 12.71%.

  25. Optimizing Data Augmentation for Real-Time Small UAV Detection: A Lightweight Context-Aware Approach

    cs.CV 2026-04 unverdicted novelty 4.0

    A Mosaic-plus-HSV data augmentation method improves mAP for small UAV detection on lightweight models across four datasets and offers better precision-stability balance under foggy conditions than Copy-Paste or MixUp.

  26. The Second Challenge on Cross-Domain Few-Shot Object Detection at NTIRE 2026: Methods and Results

    cs.CV 2026-04 unverdicted novelty 4.0

    The NTIRE 2026 CD-FSOD Challenge report details innovative methods and performance results from 19 teams on cross-domain few-shot object detection in open- and closed-source tracks.

  27. Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection

    cs.CV 2026-04 unverdicted novelty 4.0

    MDDCNet combines Mamba blocks with deformable dilated convolutions, enhanced feed-forward networks, and an attention-aggregating feature pyramid to achieve better multi-scale traffic object detection than prior detectors.

  28. Attention-Augmented YOLOv8 with Ghost Convolution for Real-Time Vehicle Detection in Intelligent Transportation Systems

    cs.CV 2026-04 unverdicted novelty 3.0

    An enhanced YOLOv8 model with Ghost Module, CBAM, and DCNv2 achieves 95.4% mAP@0.5 on the KITTI dataset for vehicle detection, an 8.97% gain over the baseline.

  29. Semantic-Fast-SAM: Efficient Semantic Segmenter

    cs.CV 2026-04 unverdicted novelty 3.0

    Semantic-Fast-SAM matches prior SAM-based semantic segmentation accuracy on Cityscapes and ADE20K while running about 20 times faster by combining FastSAM with SSA labeling and CLIP for open-vocabulary cases.

  30. Intelligent Traffic Monitoring with YOLOv11: A Case Study in Real-Time Vehicle Detection

    cs.CV 2026-04 unverdicted novelty 3.0

    A YOLOv11-based desktop application detects and counts vehicles in traffic videos with 67-96% accuracy and high F1 scores for cars and trucks.

  31. Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems

    eess.SY 2026-04 unverdicted novelty 2.0

    A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.

  32. YOLOv11 Demystified: A Practical Guide to High-Performance Object Detection

    cs.CV 2026-04 unverdicted novelty 2.0

    YOLOv11 delivers higher mean average precision on standard benchmarks than prior YOLO versions while keeping real-time inference speed through C3K2, SPPF, and C2PSA modules.

  33. YOLOv11: An Overview of the Key Architectural Enhancements

    cs.CV 2024-10 unverdicted novelty 1.0

    YOLOv11 adds blocks such as C3k2, SPPF, and C2PSA to improve feature extraction, mAP, and efficiency while supporting detection, segmentation, pose, and oriented detection across model sizes.

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · cited by 33 Pith papers · 8 internal anchors

  1. [1]

    Soft-NMS–improving object detection with one line of code

    Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-NMS–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5561–5569,

  2. [2]

    Cascade R-CNN: Delving into high quality object detection

    Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6154–6162, 2018. 12

  3. [3]

    Hi- erarchical shot detector

    Jiale Cao, Yanwei Pang, Jungong Han, and Xuelong Li. Hi- erarchical shot detector. In Proceedings of the IEEE In- ternational Conference on Computer Vision (ICCV), pages 9705–9714, 2019. 12

  4. [4]

    HarDNet: A low memory traf- fic network.Proceedings of the IEEE International Confer- ence on Computer Vision (ICCV), 2019

    Ping Chao, Chao-Yang Kao, Yu-Shan Ruan, Chien-Hsiang Huang, and Youn-Long Lin. HarDNet: A low memory traf- fic network.Proceedings of the IEEE International Confer- ence on Computer Vision (ICCV), 2019. 13

  5. [5]

    DeepLab: Semantic im- age segmentation with deep convolutional nets, atrous con- volution, and fully connected CRFs

    Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. DeepLab: Semantic im- age segmentation with deep convolutional nets, atrous con- volution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , 40(4):834–848, 2017. 2, 4

  6. [6]

    GridMask data augmentation

    Pengguang Chen. GridMask data augmentation. arXiv preprint arXiv:2001.04086, 2020. 3

  7. [7]

    DetNAS: Backbone search for object detection

    Yukang Chen, Tong Yang, Xiangyu Zhang, Gaofeng Meng, Xinyu Xiao, and Jian Sun. DetNAS: Backbone search for object detection. In Advances in Neural Information Pro- cessing Systems (NeurIPS), pages 6638–6648, 2019. 2

  8. [8]

    Gaussian YOLOv3: An accurate and fast object de- tector using localization uncertainty for autonomous driv- ing

    Jiwoong Choi, Dayoung Chun, Hyun Kim, and Hyuk-Jae Lee. Gaussian YOLOv3: An accurate and fast object de- tector using localization uncertainty for autonomous driv- ing. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 502–511, 2019. 7

  9. [9]

    R-FCN: Object detection via region-based fully convolutional net- works

    Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-FCN: Object detection via region-based fully convolutional net- works. In Advances in Neural Information Processing Sys- tems (NIPS), pages 379–387, 2016. 2

  10. [10]

    ImageNet: A large-scale hierarchical im- age database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical im- age database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 248–255, 2009. 5

  11. [11]

    Improved Regularization of Convolutional Neural Networks with Cutout

    Terrance DeVries and Graham W Taylor. Improved reg- ularization of convolutional neural networks with CutOut. arXiv preprint arXiv:1708.04552, 2017. 3

  12. [12]

    SpineNet: Learning scale-permuted backbone for recog- nition and localization

    Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V Le, and Xiaodan Song. SpineNet: Learning scale-permuted backbone for recog- nition and localization. arXiv preprint arXiv:1912.05027,

  13. [13]

    CenterNet: Keypoint triplets for object detection

    Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qing- ming Huang, and Qi Tian. CenterNet: Keypoint triplets for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 6569–6578,

  14. [14]

    RetinaMask: Learning to predict masks improves state- of-the-art single-shot detection for free

    Cheng-Yang Fu, Mykhailo Shvets, and Alexander C Berg. RetinaMask: Learning to predict masks improves state- of-the-art single-shot detection for free. arXiv preprint arXiv:1901.03353, 2019. 12

  15. [15]

    ImageNet-trained cnns are biased towards texture; increas- ing shape bias improves accuracy and robustness

    Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. ImageNet-trained cnns are biased towards texture; increas- ing shape bias improves accuracy and robustness. In Inter- national Conference on Learning Representations (ICLR) ,

  16. [16]

    DropBlock: A regularization method for convolutional networks

    Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. DropBlock: A regularization method for convolutional networks. InAd- vances in Neural Information Processing Systems (NIPS) , pages 10727–10737, 2018. 3

  17. [17]

    NAS-FPN: Learning scalable feature pyramid architecture for object detection

    Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. NAS-FPN: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 7036– 7045, 2019. 2, 13

  18. [18]

    Fast R-CNN

    Ross Girshick. Fast R-CNN. In Proceedings of the IEEE In- ternational Conference on Computer Vision (ICCV), pages 1440–1448, 2015. 2

  19. [19]

    Rich feature hierarchies for accurate object de- tection and semantic segmentation

    Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object de- tection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 580–587, 2014. 2, 4

  20. [20]

    Hit- Detector: Hierarchical trinity architecture search for object detection

    Jianyuan Guo, Kai Han, Yunhe Wang, Chao Zhang, Zhao- hui Yang, Han Wu, Xinghao Chen, and Chang Xu. Hit- Detector: Hierarchical trinity architecture search for object detection. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2020. 2

  21. [21]

    GhostNet: More features from cheap operations

    Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu. GhostNet: More features from cheap operations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2020. 5

  22. [22]

    Hypercolumns for object segmentation and fine-grained localization

    Bharath Hariharan, Pablo Arbel ´aez, Ross Girshick, and Jitendra Malik. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 447–456, 2015. 4

  23. [23]

    Mask R-CNN

    Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask R-CNN. In Proceedings of the IEEE In- ternational Conference on Computer Vision (ICCV), pages 2961–2969, 2017. 2

  24. [24]

    Delving deep into rectifiers: Surpassing human-level per- formance on ImageNet classification

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level per- formance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1026–1034, 2015. 4

  25. [25]

    Spatial pyramid pooling in deep convolutional networks for visual recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analy- sis and Machine Intelligence (TPAMI) , 37(9):1904–1916,

  26. [26]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- 14 ings of the IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 770–778, 2016. 2

  27. [27]

    Searching for Mo- bileNetV3

    Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for Mo- bileNetV3. In Proceedings of the IEEE International Con- ference on Computer Vision (ICCV), 2019. 2, 4

  28. [28]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- dreetto, and Hartwig Adam. MobileNets: Efficient con- volutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. 2, 4

  29. [29]

    Squeeze-and-excitation networks

    Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 7132– 7141, 2018. 4

  30. [30]

    Densely connected convolutional net- works

    Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- ian Q Weinberger. Densely connected convolutional net- works. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 4700– 4708, 2017. 2

  31. [31]

    SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size.arXiv2016, arXiv:1602.07360

    Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. SqueezeNet: AlexNet-level accuracy with 50x fewer pa- rameters and¡ 0.5 MB model size. arXiv preprint arXiv:1602.07360, 2016. 2

  32. [32]

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal co- variate shift. arXiv preprint arXiv:1502.03167, 2015. 6

  33. [33]

    Label refinement network for coarse-to-fine semantic segmentation

    Md Amirul Islam, Shujon Naha, Mrigank Rochan, Neil Bruce, and Yang Wang. Label refinement network for coarse-to-fine semantic segmentation. arXiv preprint arXiv:1703.00551, 2017. 3

  34. [34]

    Parallel feature pyra- mid network for object detection

    Seung-Wook Kim, Hyong-Keun Kook, Jee-Young Sun, Mun-Cheon Kang, and Sung-Jea Ko. Parallel feature pyra- mid network for object detection. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 234–250, 2018. 11

  35. [35]

    Self-normalizing neural networks

    G ¨unter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 971–980, 2017. 4

  36. [36]

    FractalNet: Ultra-deep neural net- works without residuals

    Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. FractalNet: Ultra-deep neural net- works without residuals. arXiv preprint arXiv:1605.07648,

  37. [37]

    CornerNet: Detecting objects as paired keypoints

    Hei Law and Jia Deng. CornerNet: Detecting objects as paired keypoints. In Proceedings of the European Confer- ence on Computer Vision (ECCV), pages 734–750, 2018. 2, 11

  38. [38]

    CornerNet-Lite: Efficient keypoint based object detection

    Hei Law, Yun Teng, Olga Russakovsky, and Jia Deng. CornerNet-Lite: Efficient keypoint based object detection. arXiv preprint arXiv:1904.08900, 2019. 2

  39. [39]

    Be- yond bags of features: Spatial pyramid matching for recog- nizing natural scene categories

    Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Be- yond bags of features: Spatial pyramid matching for recog- nizing natural scene categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 2169–2178. IEEE, 2006. 4

  40. [40]

    CenterMask: Real-time anchor-free instance segmentation

    Youngwan Lee and Jongyoul Park. CenterMask: Real-time anchor-free instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), 2020. 12, 13

  41. [41]

    Dynamic anchor feature selection for single-shot object detection

    Shuai Li, Lingxiao Yang, Jianqiang Huang, Xian-Sheng Hua, and Lei Zhang. Dynamic anchor feature selection for single-shot object detection. In Proceedings of the IEEE In- ternational Conference on Computer Vision (ICCV), pages 6609–6618, 2019. 12

  42. [42]

    Scale-aware trident networks for object detection

    Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. Scale-aware trident networks for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 6054–6063, 2019. 12

  43. [43]

    DetNet: Design backbone for object detection

    Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yang- dong Deng, and Jian Sun. DetNet: Design backbone for object detection. In Proceedings of the European Confer- ence on Computer Vision (ECCV) , pages 334–350, 2018. 2

  44. [44]

    Feature pyramid networks for object detection

    Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2117–2125, 2017. 2

  45. [45]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll ´ar. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Com- puter Vision (ICCV), pages 2980–2988, 2017. 2, 3, 11, 13

  46. [46]

    Microsoft COCO: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), pages 740–755, 2014. 5

  47. [47]

    Receptive field block net for accurate and fast object detection

    Songtao Liu, Di Huang, et al. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 385–400, 2018. 2, 4, 11

  48. [48]

    Learning spa- tial fusion for single-shot object detection

    Songtao Liu, Di Huang, and Yunhong Wang. Learning spa- tial fusion for single-shot object detection. arXiv preprint arXiv:1911.09516, 2019. 2, 4, 13

  49. [49]

    Path aggregation network for instance segmentation

    Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8759–8768, 2018. 1, 2, 7

  50. [50]

    SSD: Single shot multibox detector

    Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 21–37, 2016. 2, 11

  51. [51]

    Fully convolutional networks for semantic segmentation

    Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015. 4

  52. [52]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Ilya Loshchilov and Frank Hutter. SGDR: Stochas- tic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016. 7

  53. [53]

    ShuffleNetV2: Practical guidelines for efficient cnn 15 architecture design

    Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShuffleNetV2: Practical guidelines for efficient cnn 15 architecture design. In Proceedings of the European Con- ference on Computer Vision (ECCV), pages 116–131, 2018. 2

  54. [54]

    Rec- tifier nonlinearities improve neural network acoustic mod- els

    Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rec- tifier nonlinearities improve neural network acoustic mod- els. In Proceedings of International Conference on Ma- chine Learning (ICML), volume 30, page 3, 2013. 4

  55. [55]

    arXiv:1908.08681 , author =

    Diganta Misra. Mish: A self regularized non- monotonic neural activation function. arXiv preprint arXiv:1908.08681, 2019. 4

  56. [56]

    Rectified linear units improve restricted boltzmann machines

    Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of International Conference on Machine Learning (ICML), pages 807–814, 2010. 4

  57. [57]

    Enriched feature guided refinement network for object detection

    Jing Nie, Rao Muhammad Anwer, Hisham Cholakkal, Fa- had Shahbaz Khan, Yanwei Pang, and Ling Shao. Enriched feature guided refinement network for object detection. In Proceedings of the IEEE International Conference on Com- puter Vision (ICCV), pages 9537–9546, 2019. 12

  58. [58]

    Libra R-CNN: Towards bal- anced learning for object detection

    Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng, Wanli Ouyang, and Dahua Lin. Libra R-CNN: Towards bal- anced learning for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 821–830, 2019. 2, 12

  59. [59]

    Searching for Activation Functions

    Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017. 4

  60. [60]

    Matrix Nets: A new deep architecture for object detection

    Abdullah Rashwan, Agastya Kalra, and Pascal Poupart. Matrix Nets: A new deep architecture for object detection. In Proceedings of the IEEE International Conference on Computer Vision Workshop (ICCV Workshop), pages 0–0,

  61. [61]

    You only look once: Unified, real-time object de- tection

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 779– 788, 2016. 2

  62. [62]

    YOLO9000: better, faster, stronger

    Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 7263– 7271, 2017. 2

  63. [63]

    YOLOv3: An Incremental Improvement

    Joseph Redmon and Ali Farhadi. YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018. 2, 4, 7, 11

  64. [64]

    Faster R-CNN: Towards real-time object detection with re- gion proposal networks

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with re- gion proposal networks. In Advances in Neural Information Processing Systems (NIPS), pages 91–99, 2015. 2

  65. [65]

    Generalized in- tersection over union: A metric and a loss for bounding box regression

    Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 658–666, 2019. 3

  66. [66]

    MobileNetV2: In- verted residuals and linear bottlenecks

    Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2: In- verted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018. 2

  67. [67]

    Training region-based object detectors with online hard ex- ample mining

    Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard ex- ample mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 761–769, 2016. 3

  68. [68]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 2

  69. [69]

    Hide-and-Seek: A data aug- mentation technique for weakly-supervised localization and beyond

    Krishna Kumar Singh, Hao Yu, Aron Sarmasi, Gautam Pradeep, and Yong Jae Lee. Hide-and-Seek: A data aug- mentation technique for weakly-supervised localization and beyond. arXiv preprint arXiv:1811.02545, 2018. 3

  70. [70]

    Filter response normalization layer: Eliminating batch dependence in the training of deep neural networks

    Saurabh Singh and Shankar Krishnan. Filter response normalization layer: Eliminating batch dependence in the training of deep neural networks. arXiv preprint arXiv:1911.09737, 2019. 6

  71. [71]

    DropOut: A simple way to prevent neural networks from overfitting

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. DropOut: A simple way to prevent neural networks from overfitting. The jour- nal of machine learning research, 15(1):1929–1958, 2014. 3

  72. [72]

    Example-based learning for view-based human face detection

    K-K Sung and Tomaso Poggio. Example-based learning for view-based human face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , 20(1):39–51, 1998. 3

  73. [73]

    Rethinking the inception ar- chitecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception ar- chitecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2016. 3

  74. [74]

    MNAS- net: Platform-aware neural architecture search for mobile

    Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. MNAS- net: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2820–2828, 2019. 2

  75. [75]

    EfficientNet: Rethinking model scaling for convolutional neural networks

    Mingxing Tan and Quoc V Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In Pro- ceedings of International Conference on Machine Learning (ICML), 2019. 2

  76. [76]

    MixNet: Mixed depthwise convolutional kernels

    Mingxing Tan and Quoc V Le. MixNet: Mixed depthwise convolutional kernels. In Proceedings of the British Ma- chine Vision Conference (BMVC), 2019. 5

  77. [77]

    Efficient- Det: Scalable and efficient object detection

    Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficient- Det: Scalable and efficient object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 2, 4, 13

  78. [78]

    FCOS: Fully convolutional one-stage object detection

    Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: Fully convolutional one-stage object detection. InProceed- ings of the IEEE International Conference on Computer Vi- sion (ICCV), pages 9627–9636, 2019. 2

  79. [79]

    Efficient object localization using convolutional networks

    Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann Le- Cun, and Christoph Bregler. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 648–656, 2015. 6 16

  80. [80]

    Regularization of neural networks using Drop- Connect

    Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using Drop- Connect. In Proceedings of International Conference on Machine Learning (ICML), pages 1058–1066, 2013. 3

Showing first 80 references.