Recognition: 1 theorem link
· Lean TheoremYOLOv4: Optimal Speed and Accuracy of Object Detection
Pith reviewed 2026-05-12 14:29 UTC · model grok-4.3
The pith
Combining eight features including Mish activation and mosaic augmentation yields a detector with 43.5 percent AP on MS COCO at 65 frames per second.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors demonstrate that a specific collection of features—Weighted-Residual-Connections, Cross-Stage-Partial-connections, Cross mini-Batch Normalization, Self-adversarial-training, Mish activation, Mosaic data augmentation, DropBlock regularization, and CIoU loss—can be combined inside the YOLO framework to reach 43.5 percent AP and 65.7 percent AP50 on the MS COCO dataset at approximately 65 frames per second on a Tesla V100 GPU.
What carries the argument
The YOLOv4 detector formed by attaching the eight listed features to a CSPDarknet53 backbone, PANet feature pyramid, and YOLOv3-style detection head.
If this is right
- Real-time detectors can now operate at accuracy levels previously available only from slower models.
- The same feature set can be inserted into other one-stage detectors to obtain similar speed-accuracy trade-offs.
- Mosaic augmentation and CIoU loss together improve training stability and final bounding-box precision.
- DropBlock regularization and self-adversarial training raise generalization with negligible extra cost at inference time.
- Empirical testing of feature combinations on large benchmarks can outperform purely theoretical design choices.
Where Pith is reading between the lines
- Future object-detection papers may treat the YOLOv4 feature set as a new baseline rather than starting from earlier YOLO versions.
- The universal features could transfer to related tasks such as instance segmentation or video object tracking.
- On lower-power hardware the reported speed margin may allow models to run at higher input resolutions than previously practical.
- Training pipelines that adopt Mosaic and CIoU may reduce the need for extensive hyper-parameter searches.
Load-bearing premise
The chosen features will combine without harmful interactions and will deliver comparable gains on other large-scale detection datasets.
What would settle it
A controlled experiment on the Open Images dataset that applies the same feature set and training schedule but records less than a 3-point AP gain relative to the prior YOLOv3 baseline at equivalent speed.
read the original abstract
There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100. Source code is at https://github.com/AlexeyAB/darknet
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes YOLOv4, an object detection model that integrates several features—Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT), Mish activation, Mosaic data augmentation, DropBlock regularization, and CIoU loss—into the YOLO framework. It reports state-of-the-art results of 43.5% AP (65.7% AP50) on the MS COCO dataset at a real-time speed of ~65 FPS on Tesla V100, with the source code made available.
Significance. If the reported results hold under the provided ablations and code, this work offers a significant practical advance in real-time object detection by demonstrating an effective, reproducible combination of architectural and training techniques that improves the speed-accuracy trade-off over prior YOLO versions on a large-scale benchmark.
minor comments (2)
- Abstract: the list of new features repeats 'CmBN' twice, which appears to be a typographical error.
- Abstract: the text states that the authors 'combine some of them' to achieve the final result but does not explicitly identify the exact subset used for the reported 43.5% AP model; the body of the paper should make this mapping unambiguous.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the work and for recommending minor revision. The review correctly identifies the core contribution as an effective, reproducible combination of techniques that advances the speed-accuracy trade-off for real-time detection on MS COCO. We have prepared a revised manuscript that incorporates all minor suggestions implied by the recommendation.
Circularity Check
No significant circularity
full rationale
The manuscript is an empirical engineering paper that enumerates a set of architectural and training modifications (WRC, CSP, CmBN, SAT, Mish, Mosaic, DropBlock, CIoU) and reports their measured effect on MS COCO AP via ablation tables. No equations, fitted parameters, or uniqueness theorems are presented whose outputs are definitionally identical to their inputs. The central numeric claim (43.5 % AP at ~65 FPS) is therefore an experimental observation rather than a self-referential derivation; the linked repository and incremental ablations render the result externally checkable.
Axiom & Free-Parameter Ledger
free parameters (1)
- Training hyperparameters and feature-selection choices
axioms (1)
- domain assumption Features such as batch-normalization and residual connections are universal across models, tasks, and datasets
Forward citations
Cited by 33 Pith papers
-
Floating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients
Floating-point neural networks with automatic differentiation can represent arbitrary floating-point functions and their gradients under mild conditions.
-
SoK: The Next Frontier in AV Security: Systematizing Perception Attacks and the Emerging Threat of Multi-Sensor Fusion
The paper organizes perception attacks on AVs into a new taxonomy, identifies gaps in fusion-aware defenses, and validates one cross-sensor vulnerability with a proof-of-concept simulation.
-
Adaptive Slicing-Assisted Hyper Inference for Enhanced Small Object Detection in High-Resolution Imagery
ASAHI adaptively slices high-res images into 6 or 12 patches, adds slicing-assisted fine-tuning, and uses Cluster-DIoU-NMS to hit 56.8% mAP on VisDrone2019 and 22.7% on xView while running 20-25% faster than fixed sli...
-
Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection
HELP uses heatmap-guided positional embeddings and a gradient mask to suppress background noise in queries, enabling efficient small-object detection with fewer decoder layers and parameters.
-
WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects
WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.
-
AnyDepth-DETR/-YOLO: Any-depth object detection with a single network
A single network achieves any-depth object detection by splitting stages into always-executed essential paths and skippable refinement paths, trained via self-distillation on the full and minimal extremes to maintain ...
-
Transferable Physical-World Adversarial Patches Against Pedestrian Detection Models
TriPatch generates transferable physical adversarial patches via multi-stage triplet loss, appearance consistency, and data augmentation to achieve higher attack success rates on pedestrian detectors than prior methods.
-
Cross-Modal Phantom: Coordinated Camera-LiDAR Spoofing Against Multi-Sensor Fusion in Autonomous Vehicles
Simulated coordinated IR and LiDAR spoofing achieves 85.5% success deceiving MSF perception on 400 KITTI scenes by creating consistent false 3D objects.
-
FlowExtract: Procedural Knowledge Extraction from Maintenance Flowcharts
FlowExtract extracts directed graphs from ISO 5807 flowcharts via YOLOv8 node detection and arrowhead-based edge tracing, outperforming vision-language models on connectivity reconstruction.
-
ComPrivDet: Efficient Privacy Object Detection in Compressed Domains Through Inference Reuse
ComPrivDet detects privacy objects in compressed videos by reusing I-frame inferences and skipping over 80% of detections while maintaining over 96% accuracy.
-
SFFNet: Synergistic Feature Fusion Network With Dual-Domain Edge Enhancement for UAV Image Object Detection
SFFNet uses multi-scale dynamic dual-domain coupling and a synergistic feature pyramid network to reach 36.8 AP on VisDrone and 20.6 AP on UAVDT for UAV object detection.
-
YOLOv12: Attention-Centric Real-Time Object Detectors
YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.
-
Inner Monologue: Embodied Reasoning through Planning with Language Models
LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.
-
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
DINO reaches 51.3 AP on COCO val2017 with a ResNet-50 backbone after 24 epochs, a +2.7 AP gain over the prior best DETR variant.
-
YOLOX: Exceeding YOLO Series in 2021
YOLOX exceeds prior YOLO models by adopting anchor-free detection, decoupled heads, and SimOTA assignment to reach 50.0% AP on COCO for the large variant.
-
DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer
DetRefiner fuses global and local features with a Transformer to refine OVOD confidence scores, delivering up to +10.1 AP gains on novel categories across multiple datasets.
-
Investigation of cardinality classification for bacterial colony counting using explainable artificial intelligence
XAI analysis identifies high visual similarity across colony cardinality classes as the primary limit on MicrobiaNet performance in bacterial colony counting, revising prior model assessments.
-
RareSpot+: A Benchmark, Model, and Active Learning Framework for Small and Rare Wildlife in Aerial Imagery
RareSpot+ boosts small-object detection mAP by 0.13 on aerial wildlife data and cuts annotation needs to 1.7% of tiles via consistency losses and spatial priors.
-
LiDAR-based Crowd Navigation with Visible Edge Group Representation
A simplified visible edge group representation enables robot crowd navigation that matches prior methods in safety and socialness while running faster in dense settings.
-
CollideNet: Hierarchical Multi-scale Video Representation Learning with Disentanglement for Time-To-Collision Forecasting
CollideNet achieves state-of-the-art time-to-collision forecasting on three public datasets by combining multi-scale spatial aggregation with temporal disentanglement of trend and seasonality in a hierarchical transformer.
-
Edge Deep Learning in Computer Vision and Medical Diagnostics: A Comprehensive Survey
A comprehensive survey of edge deep learning in computer vision and medical diagnostics that presents a novel categorization of hardware platforms by performance and usage scenarios.
-
SynSur: An end-to-end generative pipeline for synthetic industrial surface defect generation and detection
A generative pipeline creates realistic synthetic pitting defects and other surface flaws that, when added to real training data, yield modest gains in industrial defect detectors without replacing the need for authen...
-
Resource-Constrained UAV-Based Weed Detection for Site-Specific Management on Edge Devices
YOLOv11s and RT-DETRv2-R50-M provide the best accuracy-speed trade-off for real-time weed detection on edge UAV systems, with mAP50 up to 79% and low latency.
-
Learning to count small and clustered objects with application to bacterial colonies
ACFamNet Pro reaches 9.64% mean normalized absolute error on bacterial colony images under 5-fold cross-validation, beating FamNet by 12.71%.
-
Optimizing Data Augmentation for Real-Time Small UAV Detection: A Lightweight Context-Aware Approach
A Mosaic-plus-HSV data augmentation method improves mAP for small UAV detection on lightweight models across four datasets and offers better precision-stability balance under foggy conditions than Copy-Paste or MixUp.
-
The Second Challenge on Cross-Domain Few-Shot Object Detection at NTIRE 2026: Methods and Results
The NTIRE 2026 CD-FSOD Challenge report details innovative methods and performance results from 19 teams on cross-domain few-shot object detection in open- and closed-source tracks.
-
Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection
MDDCNet combines Mamba blocks with deformable dilated convolutions, enhanced feed-forward networks, and an attention-aggregating feature pyramid to achieve better multi-scale traffic object detection than prior detectors.
-
Attention-Augmented YOLOv8 with Ghost Convolution for Real-Time Vehicle Detection in Intelligent Transportation Systems
An enhanced YOLOv8 model with Ghost Module, CBAM, and DCNv2 achieves 95.4% mAP@0.5 on the KITTI dataset for vehicle detection, an 8.97% gain over the baseline.
-
Semantic-Fast-SAM: Efficient Semantic Segmenter
Semantic-Fast-SAM matches prior SAM-based semantic segmentation accuracy on Cityscapes and ADE20K while running about 20 times faster by combining FastSAM with SSA labeling and CLIP for open-vocabulary cases.
-
Intelligent Traffic Monitoring with YOLOv11: A Case Study in Real-Time Vehicle Detection
A YOLOv11-based desktop application detects and counts vehicles in traffic videos with 67-96% accuracy and high F1 scores for cars and trucks.
-
Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems
A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.
-
YOLOv11 Demystified: A Practical Guide to High-Performance Object Detection
YOLOv11 delivers higher mean average precision on standard benchmarks than prior YOLO versions while keeping real-time inference speed through C3K2, SPPF, and C2PSA modules.
-
YOLOv11: An Overview of the Key Architectural Enhancements
YOLOv11 adds blocks such as C3k2, SPPF, and C2PSA to improve feature extraction, mAP, and efficiency while supporting detection, segmentation, pose, and oriented detection across model sizes.
Reference graph
Works this paper leans on
-
[1]
Soft-NMS–improving object detection with one line of code
Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-NMS–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5561–5569,
-
[2]
Cascade R-CNN: Delving into high quality object detection
Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6154–6162, 2018. 12
work page 2018
-
[3]
Jiale Cao, Yanwei Pang, Jungong Han, and Xuelong Li. Hi- erarchical shot detector. In Proceedings of the IEEE In- ternational Conference on Computer Vision (ICCV), pages 9705–9714, 2019. 12
work page 2019
-
[4]
Ping Chao, Chao-Yang Kao, Yu-Shan Ruan, Chien-Hsiang Huang, and Youn-Long Lin. HarDNet: A low memory traf- fic network.Proceedings of the IEEE International Confer- ence on Computer Vision (ICCV), 2019. 13
work page 2019
-
[5]
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. DeepLab: Semantic im- age segmentation with deep convolutional nets, atrous con- volution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , 40(4):834–848, 2017. 2, 4
work page 2017
-
[6]
Pengguang Chen. GridMask data augmentation. arXiv preprint arXiv:2001.04086, 2020. 3
-
[7]
DetNAS: Backbone search for object detection
Yukang Chen, Tong Yang, Xiangyu Zhang, Gaofeng Meng, Xinyu Xiao, and Jian Sun. DetNAS: Backbone search for object detection. In Advances in Neural Information Pro- cessing Systems (NeurIPS), pages 6638–6648, 2019. 2
work page 2019
-
[8]
Jiwoong Choi, Dayoung Chun, Hyun Kim, and Hyuk-Jae Lee. Gaussian YOLOv3: An accurate and fast object de- tector using localization uncertainty for autonomous driv- ing. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 502–511, 2019. 7
work page 2019
-
[9]
R-FCN: Object detection via region-based fully convolutional net- works
Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-FCN: Object detection via region-based fully convolutional net- works. In Advances in Neural Information Processing Sys- tems (NIPS), pages 379–387, 2016. 2
work page 2016
-
[10]
ImageNet: A large-scale hierarchical im- age database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical im- age database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 248–255, 2009. 5
work page 2009
-
[11]
Improved Regularization of Convolutional Neural Networks with Cutout
Terrance DeVries and Graham W Taylor. Improved reg- ularization of convolutional neural networks with CutOut. arXiv preprint arXiv:1708.04552, 2017. 3
work page internal anchor Pith review arXiv 2017
-
[12]
SpineNet: Learning scale-permuted backbone for recog- nition and localization
Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V Le, and Xiaodan Song. SpineNet: Learning scale-permuted backbone for recog- nition and localization. arXiv preprint arXiv:1912.05027,
-
[13]
CenterNet: Keypoint triplets for object detection
Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qing- ming Huang, and Qi Tian. CenterNet: Keypoint triplets for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 6569–6578,
-
[14]
RetinaMask: Learning to predict masks improves state- of-the-art single-shot detection for free
Cheng-Yang Fu, Mykhailo Shvets, and Alexander C Berg. RetinaMask: Learning to predict masks improves state- of-the-art single-shot detection for free. arXiv preprint arXiv:1901.03353, 2019. 12
-
[15]
Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. ImageNet-trained cnns are biased towards texture; increas- ing shape bias improves accuracy and robustness. In Inter- national Conference on Learning Representations (ICLR) ,
-
[16]
DropBlock: A regularization method for convolutional networks
Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. DropBlock: A regularization method for convolutional networks. InAd- vances in Neural Information Processing Systems (NIPS) , pages 10727–10737, 2018. 3
work page 2018
-
[17]
NAS-FPN: Learning scalable feature pyramid architecture for object detection
Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. NAS-FPN: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 7036– 7045, 2019. 2, 13
work page 2019
-
[18]
Ross Girshick. Fast R-CNN. In Proceedings of the IEEE In- ternational Conference on Computer Vision (ICCV), pages 1440–1448, 2015. 2
work page 2015
-
[19]
Rich feature hierarchies for accurate object de- tection and semantic segmentation
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object de- tection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 580–587, 2014. 2, 4
work page 2014
-
[20]
Hit- Detector: Hierarchical trinity architecture search for object detection
Jianyuan Guo, Kai Han, Yunhe Wang, Chao Zhang, Zhao- hui Yang, Han Wu, Xinghao Chen, and Chang Xu. Hit- Detector: Hierarchical trinity architecture search for object detection. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2020. 2
work page 2020
-
[21]
GhostNet: More features from cheap operations
Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu. GhostNet: More features from cheap operations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2020. 5
work page 2020
-
[22]
Hypercolumns for object segmentation and fine-grained localization
Bharath Hariharan, Pablo Arbel ´aez, Ross Girshick, and Jitendra Malik. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 447–456, 2015. 4
work page 2015
-
[23]
Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask R-CNN. In Proceedings of the IEEE In- ternational Conference on Computer Vision (ICCV), pages 2961–2969, 2017. 2
work page 2017
-
[24]
Delving deep into rectifiers: Surpassing human-level per- formance on ImageNet classification
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level per- formance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1026–1034, 2015. 4
work page 2015
-
[25]
Spatial pyramid pooling in deep convolutional networks for visual recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analy- sis and Machine Intelligence (TPAMI) , 37(9):1904–1916,
work page 1904
-
[26]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- 14 ings of the IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 770–778, 2016. 2
work page 2016
-
[27]
Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for Mo- bileNetV3. In Proceedings of the IEEE International Con- ference on Computer Vision (ICCV), 2019. 2, 4
work page 2019
-
[28]
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- dreetto, and Hartwig Adam. MobileNets: Efficient con- volutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[29]
Squeeze-and-excitation networks
Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 7132– 7141, 2018. 4
work page 2018
-
[30]
Densely connected convolutional net- works
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- ian Q Weinberger. Densely connected convolutional net- works. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 4700– 4708, 2017. 2
work page 2017
-
[31]
Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. SqueezeNet: AlexNet-level accuracy with 50x fewer pa- rameters and¡ 0.5 MB model size. arXiv preprint arXiv:1602.07360, 2016. 2
-
[32]
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal co- variate shift. arXiv preprint arXiv:1502.03167, 2015. 6
work page internal anchor Pith review arXiv 2015
-
[33]
Label refinement network for coarse-to-fine semantic segmentation
Md Amirul Islam, Shujon Naha, Mrigank Rochan, Neil Bruce, and Yang Wang. Label refinement network for coarse-to-fine semantic segmentation. arXiv preprint arXiv:1703.00551, 2017. 3
-
[34]
Parallel feature pyra- mid network for object detection
Seung-Wook Kim, Hyong-Keun Kook, Jee-Young Sun, Mun-Cheon Kang, and Sung-Jea Ko. Parallel feature pyra- mid network for object detection. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 234–250, 2018. 11
work page 2018
-
[35]
Self-normalizing neural networks
G ¨unter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 971–980, 2017. 4
work page 2017
-
[36]
FractalNet: Ultra-deep neural net- works without residuals
Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. FractalNet: Ultra-deep neural net- works without residuals. arXiv preprint arXiv:1605.07648,
-
[37]
CornerNet: Detecting objects as paired keypoints
Hei Law and Jia Deng. CornerNet: Detecting objects as paired keypoints. In Proceedings of the European Confer- ence on Computer Vision (ECCV), pages 734–750, 2018. 2, 11
work page 2018
-
[38]
CornerNet-Lite: Efficient keypoint based object detection
Hei Law, Yun Teng, Olga Russakovsky, and Jia Deng. CornerNet-Lite: Efficient keypoint based object detection. arXiv preprint arXiv:1904.08900, 2019. 2
-
[39]
Be- yond bags of features: Spatial pyramid matching for recog- nizing natural scene categories
Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Be- yond bags of features: Spatial pyramid matching for recog- nizing natural scene categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 2169–2178. IEEE, 2006. 4
work page 2006
-
[40]
CenterMask: Real-time anchor-free instance segmentation
Youngwan Lee and Jongyoul Park. CenterMask: Real-time anchor-free instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), 2020. 12, 13
work page 2020
-
[41]
Dynamic anchor feature selection for single-shot object detection
Shuai Li, Lingxiao Yang, Jianqiang Huang, Xian-Sheng Hua, and Lei Zhang. Dynamic anchor feature selection for single-shot object detection. In Proceedings of the IEEE In- ternational Conference on Computer Vision (ICCV), pages 6609–6618, 2019. 12
work page 2019
-
[42]
Scale-aware trident networks for object detection
Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. Scale-aware trident networks for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 6054–6063, 2019. 12
work page 2019
-
[43]
DetNet: Design backbone for object detection
Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yang- dong Deng, and Jian Sun. DetNet: Design backbone for object detection. In Proceedings of the European Confer- ence on Computer Vision (ECCV) , pages 334–350, 2018. 2
work page 2018
-
[44]
Feature pyramid networks for object detection
Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2117–2125, 2017. 2
work page 2017
-
[45]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll ´ar. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Com- puter Vision (ICCV), pages 2980–2988, 2017. 2, 3, 11, 13
work page 2017
-
[46]
Microsoft COCO: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), pages 740–755, 2014. 5
work page 2014
-
[47]
Receptive field block net for accurate and fast object detection
Songtao Liu, Di Huang, et al. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 385–400, 2018. 2, 4, 11
work page 2018
-
[48]
Learning spa- tial fusion for single-shot object detection
Songtao Liu, Di Huang, and Yunhong Wang. Learning spa- tial fusion for single-shot object detection. arXiv preprint arXiv:1911.09516, 2019. 2, 4, 13
-
[49]
Path aggregation network for instance segmentation
Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8759–8768, 2018. 1, 2, 7
work page 2018
-
[50]
SSD: Single shot multibox detector
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 21–37, 2016. 2, 11
work page 2016
-
[51]
Fully convolutional networks for semantic segmentation
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015. 4
work page 2015
-
[52]
SGDR: Stochastic Gradient Descent with Warm Restarts
Ilya Loshchilov and Frank Hutter. SGDR: Stochas- tic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016. 7
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[53]
ShuffleNetV2: Practical guidelines for efficient cnn 15 architecture design
Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShuffleNetV2: Practical guidelines for efficient cnn 15 architecture design. In Proceedings of the European Con- ference on Computer Vision (ECCV), pages 116–131, 2018. 2
work page 2018
-
[54]
Rec- tifier nonlinearities improve neural network acoustic mod- els
Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rec- tifier nonlinearities improve neural network acoustic mod- els. In Proceedings of International Conference on Ma- chine Learning (ICML), volume 30, page 3, 2013. 4
work page 2013
-
[55]
Diganta Misra. Mish: A self regularized non- monotonic neural activation function. arXiv preprint arXiv:1908.08681, 2019. 4
-
[56]
Rectified linear units improve restricted boltzmann machines
Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of International Conference on Machine Learning (ICML), pages 807–814, 2010. 4
work page 2010
-
[57]
Enriched feature guided refinement network for object detection
Jing Nie, Rao Muhammad Anwer, Hisham Cholakkal, Fa- had Shahbaz Khan, Yanwei Pang, and Ling Shao. Enriched feature guided refinement network for object detection. In Proceedings of the IEEE International Conference on Com- puter Vision (ICCV), pages 9537–9546, 2019. 12
work page 2019
-
[58]
Libra R-CNN: Towards bal- anced learning for object detection
Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng, Wanli Ouyang, and Dahua Lin. Libra R-CNN: Towards bal- anced learning for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 821–830, 2019. 2, 12
work page 2019
-
[59]
Searching for Activation Functions
Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017. 4
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[60]
Matrix Nets: A new deep architecture for object detection
Abdullah Rashwan, Agastya Kalra, and Pascal Poupart. Matrix Nets: A new deep architecture for object detection. In Proceedings of the IEEE International Conference on Computer Vision Workshop (ICCV Workshop), pages 0–0,
-
[61]
You only look once: Unified, real-time object de- tection
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 779– 788, 2016. 2
work page 2016
-
[62]
YOLO9000: better, faster, stronger
Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 7263– 7271, 2017. 2
work page 2017
-
[63]
YOLOv3: An Incremental Improvement
Joseph Redmon and Ali Farhadi. YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018. 2, 4, 7, 11
work page internal anchor Pith review arXiv 2018
-
[64]
Faster R-CNN: Towards real-time object detection with re- gion proposal networks
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with re- gion proposal networks. In Advances in Neural Information Processing Systems (NIPS), pages 91–99, 2015. 2
work page 2015
-
[65]
Generalized in- tersection over union: A metric and a loss for bounding box regression
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 658–666, 2019. 3
work page 2019
-
[66]
MobileNetV2: In- verted residuals and linear bottlenecks
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2: In- verted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018. 2
work page 2018
-
[67]
Training region-based object detectors with online hard ex- ample mining
Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard ex- ample mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 761–769, 2016. 3
work page 2016
-
[68]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 2
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[69]
Hide-and-Seek: A data aug- mentation technique for weakly-supervised localization and beyond
Krishna Kumar Singh, Hao Yu, Aron Sarmasi, Gautam Pradeep, and Yong Jae Lee. Hide-and-Seek: A data aug- mentation technique for weakly-supervised localization and beyond. arXiv preprint arXiv:1811.02545, 2018. 3
-
[70]
Saurabh Singh and Shankar Krishnan. Filter response normalization layer: Eliminating batch dependence in the training of deep neural networks. arXiv preprint arXiv:1911.09737, 2019. 6
-
[71]
DropOut: A simple way to prevent neural networks from overfitting
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. DropOut: A simple way to prevent neural networks from overfitting. The jour- nal of machine learning research, 15(1):1929–1958, 2014. 3
work page 1929
-
[72]
Example-based learning for view-based human face detection
K-K Sung and Tomaso Poggio. Example-based learning for view-based human face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , 20(1):39–51, 1998. 3
work page 1998
-
[73]
Rethinking the inception ar- chitecture for computer vision
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception ar- chitecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2016. 3
work page 2016
-
[74]
MNAS- net: Platform-aware neural architecture search for mobile
Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. MNAS- net: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2820–2828, 2019. 2
work page 2019
-
[75]
EfficientNet: Rethinking model scaling for convolutional neural networks
Mingxing Tan and Quoc V Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In Pro- ceedings of International Conference on Machine Learning (ICML), 2019. 2
work page 2019
-
[76]
MixNet: Mixed depthwise convolutional kernels
Mingxing Tan and Quoc V Le. MixNet: Mixed depthwise convolutional kernels. In Proceedings of the British Ma- chine Vision Conference (BMVC), 2019. 5
work page 2019
-
[77]
Efficient- Det: Scalable and efficient object detection
Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficient- Det: Scalable and efficient object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 2, 4, 13
work page 2020
-
[78]
FCOS: Fully convolutional one-stage object detection
Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: Fully convolutional one-stage object detection. InProceed- ings of the IEEE International Conference on Computer Vi- sion (ICCV), pages 9627–9636, 2019. 2
work page 2019
-
[79]
Efficient object localization using convolutional networks
Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann Le- Cun, and Christoph Bregler. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 648–656, 2015. 6 16
work page 2015
-
[80]
Regularization of neural networks using Drop- Connect
Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using Drop- Connect. In Proceedings of International Conference on Machine Learning (ICML), pages 1058–1066, 2013. 3
work page 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.