arxiv: 2004.10934 · v1 · submitted 2020-04-23 · 💻 cs.CV · eess.IV

Recognition: 1 theorem link

· Lean Theorem

YOLOv4: Optimal Speed and Accuracy of Object Detection

Alexey Bochkovskiy , Chien-Yao Wang , Hong-Yuan Mark Liao

Authors on Pith no claims yet

Pith reviewed 2026-05-12 14:29 UTC · model grok-4.3

classification 💻 cs.CV eess.IV

keywords object detectionYOLOreal-time inferenceconvolutional neural networksdata augmentationactivation functionsbounding box regressionMS COCO dataset

0 comments

The pith

Combining eight features including Mish activation and mosaic augmentation yields a detector with 43.5 percent AP on MS COCO at 65 frames per second.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines many proposed improvements to convolutional networks for object detection and selects those that appear to work across different models and large datasets. It integrates Weighted-Residual-Connections, Cross-Stage-Partial connections, Cross mini-Batch Normalization, Self-adversarial training, Mish activation, Mosaic data augmentation, DropBlock regularization, and CIoU loss into a single architecture called YOLOv4. This produces state-of-the-art accuracy on the MS COCO benchmark while preserving real-time inference speed on current GPUs. A sympathetic reader would care because object detection systems must deliver both high precision and low latency to be useful in video streams, autonomous systems, and other time-sensitive settings.

Core claim

The authors demonstrate that a specific collection of features—Weighted-Residual-Connections, Cross-Stage-Partial-connections, Cross mini-Batch Normalization, Self-adversarial-training, Mish activation, Mosaic data augmentation, DropBlock regularization, and CIoU loss—can be combined inside the YOLO framework to reach 43.5 percent AP and 65.7 percent AP50 on the MS COCO dataset at approximately 65 frames per second on a Tesla V100 GPU.

What carries the argument

The YOLOv4 detector formed by attaching the eight listed features to a CSPDarknet53 backbone, PANet feature pyramid, and YOLOv3-style detection head.

If this is right

Real-time detectors can now operate at accuracy levels previously available only from slower models.
The same feature set can be inserted into other one-stage detectors to obtain similar speed-accuracy trade-offs.
Mosaic augmentation and CIoU loss together improve training stability and final bounding-box precision.
DropBlock regularization and self-adversarial training raise generalization with negligible extra cost at inference time.
Empirical testing of feature combinations on large benchmarks can outperform purely theoretical design choices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future object-detection papers may treat the YOLOv4 feature set as a new baseline rather than starting from earlier YOLO versions.
The universal features could transfer to related tasks such as instance segmentation or video object tracking.
On lower-power hardware the reported speed margin may allow models to run at higher input resolutions than previously practical.
Training pipelines that adopt Mosaic and CIoU may reduce the need for extensive hyper-parameter searches.

Load-bearing premise

The chosen features will combine without harmful interactions and will deliver comparable gains on other large-scale detection datasets.

What would settle it

A controlled experiment on the Open Images dataset that applies the same feature set and training schedule but records less than a 3-point AP gain relative to the prior YOLOv3 baseline at equivalent speed.

read the original abstract

There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100. Source code is at https://github.com/AlexeyAB/darknet

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

YOLOv4 is a careful engineering combination of existing components that delivers a verifiable 43.5% AP at 65 FPS on COCO, backed by ablations and released code.

read the letter

The core result is that a YOLOv3-derived model with CSP connections, Mish activation, CIoU loss, Mosaic augmentation, and a handful of other tweaks reaches 43.5% AP (65.7% AP50) at roughly 65 FPS on MS COCO using a Tesla V100. The paper treats this as the outcome of testing which features are broadly useful rather than model-specific, and it supplies incremental ablation tables that isolate each addition's effect along with the exact training recipe in the linked darknet repository. That combination of numbers, tables, and code makes the performance claim checkable instead of resting on an untested assumption that the tricks will always transfer.

Referee Report

0 major / 2 minor

Summary. The manuscript describes YOLOv4, an object detection model that integrates several features—Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT), Mish activation, Mosaic data augmentation, DropBlock regularization, and CIoU loss—into the YOLO framework. It reports state-of-the-art results of 43.5% AP (65.7% AP50) on the MS COCO dataset at a real-time speed of ~65 FPS on Tesla V100, with the source code made available.

Significance. If the reported results hold under the provided ablations and code, this work offers a significant practical advance in real-time object detection by demonstrating an effective, reproducible combination of architectural and training techniques that improves the speed-accuracy trade-off over prior YOLO versions on a large-scale benchmark.

minor comments (2)

Abstract: the list of new features repeats 'CmBN' twice, which appears to be a typographical error.
Abstract: the text states that the authors 'combine some of them' to achieve the final result but does not explicitly identify the exact subset used for the reported 43.5% AP model; the body of the paper should make this mapping unambiguous.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the work and for recommending minor revision. The review correctly identifies the core contribution as an effective, reproducible combination of techniques that advances the speed-accuracy trade-off for real-time detection on MS COCO. We have prepared a revised manuscript that incorporates all minor suggestions implied by the recommendation.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is an empirical engineering paper that enumerates a set of architectural and training modifications (WRC, CSP, CmBN, SAT, Mish, Mosaic, DropBlock, CIoU) and reports their measured effect on MS COCO AP via ablation tables. No equations, fitted parameters, or uniqueness theorems are presented whose outputs are definitionally identical to their inputs. The central numeric claim (43.5 % AP at ~65 FPS) is therefore an experimental observation rather than a self-referential derivation; the linked repository and incremental ablations render the result externally checkable.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that a particular set of modular improvements can be combined without destructive interference on the COCO benchmark; no new physical or mathematical axioms are introduced.

free parameters (1)

Training hyperparameters and feature-selection choices
Standard ML practice; the abstract does not enumerate the exact values or search procedure used to arrive at the final combination.

axioms (1)

domain assumption Features such as batch-normalization and residual connections are universal across models, tasks, and datasets
Explicitly stated in the abstract as the basis for selecting WRC, CSP, CmBN, SAT, and Mish.

pith-pipeline@v0.9.0 · 5511 in / 1429 out tokens · 37413 ms · 2026-05-12T14:29:03.443017+00:00 · methodology

discussion (0)

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Floating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients
cs.LG 2026-05 unverdicted novelty 8.0

Floating-point neural networks with automatic differentiation can represent arbitrary floating-point functions and their gradients under mild conditions.
SoK: The Next Frontier in AV Security: Systematizing Perception Attacks and the Emerging Threat of Multi-Sensor Fusion
cs.CR 2026-04 unverdicted novelty 7.0

The paper organizes perception attacks on AVs into a new taxonomy, identifies gaps in fusion-aware defenses, and validates one cross-sensor vulnerability with a proof-of-concept simulation.
Adaptive Slicing-Assisted Hyper Inference for Enhanced Small Object Detection in High-Resolution Imagery
cs.CV 2026-04 unverdicted novelty 7.0

ASAHI adaptively slices high-res images into 6 or 12 patches, adds slicing-assisted fine-tuning, and uses Cluster-DIoU-NMS to hit 56.8% mAP on VisDrone2019 and 22.7% on xView while running 20-25% faster than fixed sli...
Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection
cs.CV 2026-04 unverdicted novelty 7.0

HELP uses heatmap-guided positional embeddings and a gradient mask to suppress background noise in queries, enabling efficient small-object detection with fewer decoder layers and parameters.
WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects
cs.CV 2026-04 unverdicted novelty 7.0

WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.
AnyDepth-DETR/-YOLO: Any-depth object detection with a single network
cs.CV 2026-05 unverdicted novelty 6.0

A single network achieves any-depth object detection by splitting stages into always-executed essential paths and skippable refinement paths, trained via self-distillation on the full and minimal extremes to maintain ...
Transferable Physical-World Adversarial Patches Against Pedestrian Detection Models
cs.CV 2026-04 unverdicted novelty 6.0

TriPatch generates transferable physical adversarial patches via multi-stage triplet loss, appearance consistency, and data augmentation to achieve higher attack success rates on pedestrian detectors than prior methods.
Cross-Modal Phantom: Coordinated Camera-LiDAR Spoofing Against Multi-Sensor Fusion in Autonomous Vehicles
cs.CR 2026-04 unverdicted novelty 6.0

Simulated coordinated IR and LiDAR spoofing achieves 85.5% success deceiving MSF perception on 400 KITTI scenes by creating consistent false 3D objects.
FlowExtract: Procedural Knowledge Extraction from Maintenance Flowcharts
cs.CV 2026-04 unverdicted novelty 6.0

FlowExtract extracts directed graphs from ISO 5807 flowcharts via YOLOv8 node detection and arrowhead-based edge tracing, outperforming vision-language models on connectivity reconstruction.
ComPrivDet: Efficient Privacy Object Detection in Compressed Domains Through Inference Reuse
cs.CV 2026-04 unverdicted novelty 6.0

ComPrivDet detects privacy objects in compressed videos by reusing I-frame inferences and skipping over 80% of detections while maintaining over 96% accuracy.
SFFNet: Synergistic Feature Fusion Network With Dual-Domain Edge Enhancement for UAV Image Object Detection
cs.CV 2026-04 unverdicted novelty 6.0

SFFNet uses multi-scale dynamic dual-domain coupling and a synergistic feature pyramid network to reach 36.8 AP on VisDrone and 20.6 AP on UAVDT for UAV object detection.
YOLOv12: Attention-Centric Real-Time Object Detectors
cs.CV 2025-02 unverdicted novelty 6.0

YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.
Inner Monologue: Embodied Reasoning through Planning with Language Models
cs.RO 2022-07 unverdicted novelty 6.0

LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
cs.CV 2022-03 conditional novelty 6.0

DINO reaches 51.3 AP on COCO val2017 with a ResNet-50 backbone after 24 epochs, a +2.7 AP gain over the prior best DETR variant.
YOLOX: Exceeding YOLO Series in 2021
cs.CV 2021-07 accept novelty 6.0

YOLOX exceeds prior YOLO models by adopting anchor-free detection, decoupled heads, and SimOTA assignment to reach 50.0% AP on COCO for the large variant.
DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer
cs.CV 2026-05 unverdicted novelty 5.0

DetRefiner fuses global and local features with a Transformer to refine OVOD confidence scores, delivering up to +10.1 AP gains on novel categories across multiple datasets.
Investigation of cardinality classification for bacterial colony counting using explainable artificial intelligence
cs.CV 2026-04 unverdicted novelty 5.0

XAI analysis identifies high visual similarity across colony cardinality classes as the primary limit on MicrobiaNet performance in bacterial colony counting, revising prior model assessments.
RareSpot+: A Benchmark, Model, and Active Learning Framework for Small and Rare Wildlife in Aerial Imagery
cs.CV 2026-04 unverdicted novelty 5.0

RareSpot+ boosts small-object detection mAP by 0.13 on aerial wildlife data and cuts annotation needs to 1.7% of tiles via consistency losses and spatial priors.
LiDAR-based Crowd Navigation with Visible Edge Group Representation
cs.RO 2026-04 unverdicted novelty 5.0

A simplified visible edge group representation enables robot crowd navigation that matches prior methods in safety and socialness while running faster in dense settings.
CollideNet: Hierarchical Multi-scale Video Representation Learning with Disentanglement for Time-To-Collision Forecasting
cs.CV 2026-04 unverdicted novelty 5.0

CollideNet achieves state-of-the-art time-to-collision forecasting on three public datasets by combining multi-scale spatial aggregation with temporal disentanglement of trend and seasonality in a hierarchical transformer.
Edge Deep Learning in Computer Vision and Medical Diagnostics: A Comprehensive Survey
cs.CV 2026-05 unverdicted novelty 4.0

A comprehensive survey of edge deep learning in computer vision and medical diagnostics that presents a novel categorization of hardware platforms by performance and usage scenarios.
SynSur: An end-to-end generative pipeline for synthetic industrial surface defect generation and detection
cs.CV 2026-04 unverdicted novelty 4.0

A generative pipeline creates realistic synthetic pitting defects and other surface flaws that, when added to real training data, yield modest gains in industrial defect detectors without replacing the need for authen...
Resource-Constrained UAV-Based Weed Detection for Site-Specific Management on Edge Devices
cs.CV 2026-04 unverdicted novelty 4.0

YOLOv11s and RT-DETRv2-R50-M provide the best accuracy-speed trade-off for real-time weed detection on edge UAV systems, with mAP50 up to 79% and low latency.
Learning to count small and clustered objects with application to bacterial colonies
cs.CV 2026-04 unverdicted novelty 4.0

ACFamNet Pro reaches 9.64% mean normalized absolute error on bacterial colony images under 5-fold cross-validation, beating FamNet by 12.71%.
Optimizing Data Augmentation for Real-Time Small UAV Detection: A Lightweight Context-Aware Approach
cs.CV 2026-04 unverdicted novelty 4.0

A Mosaic-plus-HSV data augmentation method improves mAP for small UAV detection on lightweight models across four datasets and offers better precision-stability balance under foggy conditions than Copy-Paste or MixUp.
The Second Challenge on Cross-Domain Few-Shot Object Detection at NTIRE 2026: Methods and Results
cs.CV 2026-04 unverdicted novelty 4.0

The NTIRE 2026 CD-FSOD Challenge report details innovative methods and performance results from 19 teams on cross-domain few-shot object detection in open- and closed-source tracks.
Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection
cs.CV 2026-04 unverdicted novelty 4.0

MDDCNet combines Mamba blocks with deformable dilated convolutions, enhanced feed-forward networks, and an attention-aggregating feature pyramid to achieve better multi-scale traffic object detection than prior detectors.
Attention-Augmented YOLOv8 with Ghost Convolution for Real-Time Vehicle Detection in Intelligent Transportation Systems
cs.CV 2026-04 unverdicted novelty 3.0

An enhanced YOLOv8 model with Ghost Module, CBAM, and DCNv2 achieves 95.4% mAP@0.5 on the KITTI dataset for vehicle detection, an 8.97% gain over the baseline.
Semantic-Fast-SAM: Efficient Semantic Segmenter
cs.CV 2026-04 unverdicted novelty 3.0

Semantic-Fast-SAM matches prior SAM-based semantic segmentation accuracy on Cityscapes and ADE20K while running about 20 times faster by combining FastSAM with SSA labeling and CLIP for open-vocabulary cases.
Intelligent Traffic Monitoring with YOLOv11: A Case Study in Real-Time Vehicle Detection
cs.CV 2026-04 unverdicted novelty 3.0

A YOLOv11-based desktop application detects and counts vehicles in traffic videos with 67-96% accuracy and high F1 scores for cars and trucks.
Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems
eess.SY 2026-04 unverdicted novelty 2.0

A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.
YOLOv11 Demystified: A Practical Guide to High-Performance Object Detection
cs.CV 2026-04 unverdicted novelty 2.0

YOLOv11 delivers higher mean average precision on standard benchmarks than prior YOLO versions while keeping real-time inference speed through C3K2, SPPF, and C2PSA modules.
YOLOv11: An Overview of the Key Architectural Enhancements
cs.CV 2024-10 unverdicted novelty 1.0

YOLOv11 adds blocks such as C3k2, SPPF, and C2PSA to improve feature extraction, mAP, and efficiency while supporting detection, segmentation, pose, and oriented detection across model sizes.

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · cited by 33 Pith papers · 8 internal anchors

[1]

Soft-NMS–improving object detection with one line of code

Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-NMS–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5561–5569,

work page
[2]

Cascade R-CNN: Delving into high quality object detection

Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6154–6162, 2018. 12

work page 2018
[3]

Hi- erarchical shot detector

Jiale Cao, Yanwei Pang, Jungong Han, and Xuelong Li. Hi- erarchical shot detector. In Proceedings of the IEEE In- ternational Conference on Computer Vision (ICCV), pages 9705–9714, 2019. 12

work page 2019
[4]

HarDNet: A low memory traf- ﬁc network.Proceedings of the IEEE International Confer- ence on Computer Vision (ICCV), 2019

Ping Chao, Chao-Yang Kao, Yu-Shan Ruan, Chien-Hsiang Huang, and Youn-Long Lin. HarDNet: A low memory traf- ﬁc network.Proceedings of the IEEE International Confer- ence on Computer Vision (ICCV), 2019. 13

work page 2019
[5]

DeepLab: Semantic im- age segmentation with deep convolutional nets, atrous con- volution, and fully connected CRFs

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. DeepLab: Semantic im- age segmentation with deep convolutional nets, atrous con- volution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , 40(4):834–848, 2017. 2, 4

work page 2017
[6]

GridMask data augmentation

Pengguang Chen. GridMask data augmentation. arXiv preprint arXiv:2001.04086, 2020. 3

work page arXiv 2001
[7]

DetNAS: Backbone search for object detection

Yukang Chen, Tong Yang, Xiangyu Zhang, Gaofeng Meng, Xinyu Xiao, and Jian Sun. DetNAS: Backbone search for object detection. In Advances in Neural Information Pro- cessing Systems (NeurIPS), pages 6638–6648, 2019. 2

work page 2019
[8]

Gaussian YOLOv3: An accurate and fast object de- tector using localization uncertainty for autonomous driv- ing

Jiwoong Choi, Dayoung Chun, Hyun Kim, and Hyuk-Jae Lee. Gaussian YOLOv3: An accurate and fast object de- tector using localization uncertainty for autonomous driv- ing. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 502–511, 2019. 7

work page 2019
[9]

R-FCN: Object detection via region-based fully convolutional net- works

Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-FCN: Object detection via region-based fully convolutional net- works. In Advances in Neural Information Processing Sys- tems (NIPS), pages 379–387, 2016. 2

work page 2016
[10]

ImageNet: A large-scale hierarchical im- age database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical im- age database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 248–255, 2009. 5

work page 2009
[11]

Improved Regularization of Convolutional Neural Networks with Cutout

Terrance DeVries and Graham W Taylor. Improved reg- ularization of convolutional neural networks with CutOut. arXiv preprint arXiv:1708.04552, 2017. 3

work page internal anchor Pith review arXiv 2017
[12]

SpineNet: Learning scale-permuted backbone for recog- nition and localization

Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V Le, and Xiaodan Song. SpineNet: Learning scale-permuted backbone for recog- nition and localization. arXiv preprint arXiv:1912.05027,

work page arXiv 1912
[13]

CenterNet: Keypoint triplets for object detection

Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qing- ming Huang, and Qi Tian. CenterNet: Keypoint triplets for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 6569–6578,

work page
[14]

RetinaMask: Learning to predict masks improves state- of-the-art single-shot detection for free

Cheng-Yang Fu, Mykhailo Shvets, and Alexander C Berg. RetinaMask: Learning to predict masks improves state- of-the-art single-shot detection for free. arXiv preprint arXiv:1901.03353, 2019. 12

work page arXiv 1901
[15]

ImageNet-trained cnns are biased towards texture; increas- ing shape bias improves accuracy and robustness

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. ImageNet-trained cnns are biased towards texture; increas- ing shape bias improves accuracy and robustness. In Inter- national Conference on Learning Representations (ICLR) ,

work page
[16]

DropBlock: A regularization method for convolutional networks

Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. DropBlock: A regularization method for convolutional networks. InAd- vances in Neural Information Processing Systems (NIPS) , pages 10727–10737, 2018. 3

work page 2018
[17]

NAS-FPN: Learning scalable feature pyramid architecture for object detection

Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. NAS-FPN: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 7036– 7045, 2019. 2, 13

work page 2019
[18]

Fast R-CNN

Ross Girshick. Fast R-CNN. In Proceedings of the IEEE In- ternational Conference on Computer Vision (ICCV), pages 1440–1448, 2015. 2

work page 2015
[19]

Rich feature hierarchies for accurate object de- tection and semantic segmentation

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object de- tection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 580–587, 2014. 2, 4

work page 2014
[20]

Hit- Detector: Hierarchical trinity architecture search for object detection

Jianyuan Guo, Kai Han, Yunhe Wang, Chao Zhang, Zhao- hui Yang, Han Wu, Xinghao Chen, and Chang Xu. Hit- Detector: Hierarchical trinity architecture search for object detection. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2020. 2

work page 2020
[21]

GhostNet: More features from cheap operations

Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu. GhostNet: More features from cheap operations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2020. 5

work page 2020
[22]

Hypercolumns for object segmentation and ﬁne-grained localization

Bharath Hariharan, Pablo Arbel ´aez, Ross Girshick, and Jitendra Malik. Hypercolumns for object segmentation and ﬁne-grained localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 447–456, 2015. 4

work page 2015
[23]

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask R-CNN. In Proceedings of the IEEE In- ternational Conference on Computer Vision (ICCV), pages 2961–2969, 2017. 2

work page 2017
[24]

Delving deep into rectiﬁers: Surpassing human-level per- formance on ImageNet classiﬁcation

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassing human-level per- formance on ImageNet classiﬁcation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1026–1034, 2015. 4

work page 2015
[25]

Spatial pyramid pooling in deep convolutional networks for visual recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analy- sis and Machine Intelligence (TPAMI) , 37(9):1904–1916,

work page 1904
[26]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- 14 ings of the IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 770–778, 2016. 2

work page 2016
[27]

Searching for Mo- bileNetV3

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for Mo- bileNetV3. In Proceedings of the IEEE International Con- ference on Computer Vision (ICCV), 2019. 2, 4

work page 2019
[28]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- dreetto, and Hartwig Adam. MobileNets: Efﬁcient con- volutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

Squeeze-and-excitation networks

Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 7132– 7141, 2018. 4

work page 2018
[30]

Densely connected convolutional net- works

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- ian Q Weinberger. Densely connected convolutional net- works. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 4700– 4708, 2017. 2

work page 2017
[31]

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size.arXiv2016, arXiv:1602.07360

Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. SqueezeNet: AlexNet-level accuracy with 50x fewer pa- rameters and¡ 0.5 MB model size. arXiv preprint arXiv:1602.07360, 2016. 2

work page arXiv 2016
[32]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal co- variate shift. arXiv preprint arXiv:1502.03167, 2015. 6

work page internal anchor Pith review arXiv 2015
[33]

Label reﬁnement network for coarse-to-ﬁne semantic segmentation

Md Amirul Islam, Shujon Naha, Mrigank Rochan, Neil Bruce, and Yang Wang. Label reﬁnement network for coarse-to-ﬁne semantic segmentation. arXiv preprint arXiv:1703.00551, 2017. 3

work page arXiv 2017
[34]

Parallel feature pyra- mid network for object detection

Seung-Wook Kim, Hyong-Keun Kook, Jee-Young Sun, Mun-Cheon Kang, and Sung-Jea Ko. Parallel feature pyra- mid network for object detection. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 234–250, 2018. 11

work page 2018
[35]

Self-normalizing neural networks

G ¨unter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 971–980, 2017. 4

work page 2017
[36]

FractalNet: Ultra-deep neural net- works without residuals

Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. FractalNet: Ultra-deep neural net- works without residuals. arXiv preprint arXiv:1605.07648,

work page arXiv
[37]

CornerNet: Detecting objects as paired keypoints

Hei Law and Jia Deng. CornerNet: Detecting objects as paired keypoints. In Proceedings of the European Confer- ence on Computer Vision (ECCV), pages 734–750, 2018. 2, 11

work page 2018
[38]

CornerNet-Lite: Efﬁcient keypoint based object detection

Hei Law, Yun Teng, Olga Russakovsky, and Jia Deng. CornerNet-Lite: Efﬁcient keypoint based object detection. arXiv preprint arXiv:1904.08900, 2019. 2

work page arXiv 1904
[39]

Be- yond bags of features: Spatial pyramid matching for recog- nizing natural scene categories

Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Be- yond bags of features: Spatial pyramid matching for recog- nizing natural scene categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 2169–2178. IEEE, 2006. 4

work page 2006
[40]

CenterMask: Real-time anchor-free instance segmentation

Youngwan Lee and Jongyoul Park. CenterMask: Real-time anchor-free instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), 2020. 12, 13

work page 2020
[41]

Dynamic anchor feature selection for single-shot object detection

Shuai Li, Lingxiao Yang, Jianqiang Huang, Xian-Sheng Hua, and Lei Zhang. Dynamic anchor feature selection for single-shot object detection. In Proceedings of the IEEE In- ternational Conference on Computer Vision (ICCV), pages 6609–6618, 2019. 12

work page 2019
[42]

Scale-aware trident networks for object detection

Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. Scale-aware trident networks for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 6054–6063, 2019. 12

work page 2019
[43]

DetNet: Design backbone for object detection

Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yang- dong Deng, and Jian Sun. DetNet: Design backbone for object detection. In Proceedings of the European Confer- ence on Computer Vision (ECCV) , pages 334–350, 2018. 2

work page 2018
[44]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2117–2125, 2017. 2

work page 2017
[45]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll ´ar. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Com- puter Vision (ICCV), pages 2980–2988, 2017. 2, 3, 11, 13

work page 2017
[46]

Microsoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), pages 740–755, 2014. 5

work page 2014
[47]

Receptive ﬁeld block net for accurate and fast object detection

Songtao Liu, Di Huang, et al. Receptive ﬁeld block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 385–400, 2018. 2, 4, 11

work page 2018
[48]

Learning spa- tial fusion for single-shot object detection

Songtao Liu, Di Huang, and Yunhong Wang. Learning spa- tial fusion for single-shot object detection. arXiv preprint arXiv:1911.09516, 2019. 2, 4, 13

work page arXiv 1911
[49]

Path aggregation network for instance segmentation

Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8759–8768, 2018. 1, 2, 7

work page 2018
[50]

SSD: Single shot multibox detector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 21–37, 2016. 2, 11

work page 2016
[51]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015. 4

work page 2015
[52]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. SGDR: Stochas- tic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016. 7

work page internal anchor Pith review Pith/arXiv arXiv 2016
[53]

ShufﬂeNetV2: Practical guidelines for efﬁcient cnn 15 architecture design

Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShufﬂeNetV2: Practical guidelines for efﬁcient cnn 15 architecture design. In Proceedings of the European Con- ference on Computer Vision (ECCV), pages 116–131, 2018. 2

work page 2018
[54]

Rec- tiﬁer nonlinearities improve neural network acoustic mod- els

Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rec- tiﬁer nonlinearities improve neural network acoustic mod- els. In Proceedings of International Conference on Ma- chine Learning (ICML), volume 30, page 3, 2013. 4

work page 2013
[55]

arXiv:1908.08681 , author =

Diganta Misra. Mish: A self regularized non- monotonic neural activation function. arXiv preprint arXiv:1908.08681, 2019. 4

work page arXiv 1908
[56]

Rectiﬁed linear units improve restricted boltzmann machines

Vinod Nair and Geoffrey E Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In Proceedings of International Conference on Machine Learning (ICML), pages 807–814, 2010. 4

work page 2010
[57]

Enriched feature guided reﬁnement network for object detection

Jing Nie, Rao Muhammad Anwer, Hisham Cholakkal, Fa- had Shahbaz Khan, Yanwei Pang, and Ling Shao. Enriched feature guided reﬁnement network for object detection. In Proceedings of the IEEE International Conference on Com- puter Vision (ICCV), pages 9537–9546, 2019. 12

work page 2019
[58]

Libra R-CNN: Towards bal- anced learning for object detection

Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng, Wanli Ouyang, and Dahua Lin. Libra R-CNN: Towards bal- anced learning for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 821–830, 2019. 2, 12

work page 2019
[59]

Searching for Activation Functions

Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017. 4

work page internal anchor Pith review Pith/arXiv arXiv 2017
[60]

Matrix Nets: A new deep architecture for object detection

Abdullah Rashwan, Agastya Kalra, and Pascal Poupart. Matrix Nets: A new deep architecture for object detection. In Proceedings of the IEEE International Conference on Computer Vision Workshop (ICCV Workshop), pages 0–0,

work page
[61]

You only look once: Uniﬁed, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Uniﬁed, real-time object de- tection. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 779– 788, 2016. 2

work page 2016
[62]

YOLO9000: better, faster, stronger

Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 7263– 7271, 2017. 2

work page 2017
[63]

YOLOv3: An Incremental Improvement

Joseph Redmon and Ali Farhadi. YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018. 2, 4, 7, 11

work page internal anchor Pith review arXiv 2018
[64]

Faster R-CNN: Towards real-time object detection with re- gion proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with re- gion proposal networks. In Advances in Neural Information Processing Systems (NIPS), pages 91–99, 2015. 2

work page 2015
[65]

Generalized in- tersection over union: A metric and a loss for bounding box regression

Hamid Rezatoﬁghi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 658–666, 2019. 3

work page 2019
[66]

MobileNetV2: In- verted residuals and linear bottlenecks

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2: In- verted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018. 2

work page 2018
[67]

Training region-based object detectors with online hard ex- ample mining

Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard ex- ample mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 761–769, 2016. 3

work page 2016
[68]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 2

work page internal anchor Pith review Pith/arXiv arXiv 2014
[69]

Hide-and-Seek: A data aug- mentation technique for weakly-supervised localization and beyond

Krishna Kumar Singh, Hao Yu, Aron Sarmasi, Gautam Pradeep, and Yong Jae Lee. Hide-and-Seek: A data aug- mentation technique for weakly-supervised localization and beyond. arXiv preprint arXiv:1811.02545, 2018. 3

work page arXiv 2018
[70]

Filter response normalization layer: Eliminating batch dependence in the training of deep neural networks

Saurabh Singh and Shankar Krishnan. Filter response normalization layer: Eliminating batch dependence in the training of deep neural networks. arXiv preprint arXiv:1911.09737, 2019. 6

work page arXiv 1911
[71]

DropOut: A simple way to prevent neural networks from overﬁtting

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. DropOut: A simple way to prevent neural networks from overﬁtting. The jour- nal of machine learning research, 15(1):1929–1958, 2014. 3

work page 1929
[72]

Example-based learning for view-based human face detection

K-K Sung and Tomaso Poggio. Example-based learning for view-based human face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , 20(1):39–51, 1998. 3

work page 1998
[73]

Rethinking the inception ar- chitecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception ar- chitecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2016. 3

work page 2016
[74]

MNAS- net: Platform-aware neural architecture search for mobile

Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. MNAS- net: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2820–2828, 2019. 2

work page 2019
[75]

EfﬁcientNet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc V Le. EfﬁcientNet: Rethinking model scaling for convolutional neural networks. In Pro- ceedings of International Conference on Machine Learning (ICML), 2019. 2

work page 2019
[76]

MixNet: Mixed depthwise convolutional kernels

Mingxing Tan and Quoc V Le. MixNet: Mixed depthwise convolutional kernels. In Proceedings of the British Ma- chine Vision Conference (BMVC), 2019. 5

work page 2019
[77]

Efﬁcient- Det: Scalable and efﬁcient object detection

Mingxing Tan, Ruoming Pang, and Quoc V Le. Efﬁcient- Det: Scalable and efﬁcient object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 2, 4, 13

work page 2020
[78]

FCOS: Fully convolutional one-stage object detection

Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: Fully convolutional one-stage object detection. InProceed- ings of the IEEE International Conference on Computer Vi- sion (ICCV), pages 9627–9636, 2019. 2

work page 2019
[79]

Efﬁcient object localization using convolutional networks

Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann Le- Cun, and Christoph Bregler. Efﬁcient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 648–656, 2015. 6 16

work page 2015
[80]

Regularization of neural networks using Drop- Connect

Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using Drop- Connect. In Proceedings of International Conference on Machine Learning (ICML), pages 1058–1066, 2013. 3

work page 2013

Showing first 80 references.