arxiv: 1704.04861 · v1 · submitted 2017-04-17 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G. Howard, Bo Chen, Dmitry Kalenichenko, Hartwig Adam, Marco Andreetto, Menglong Zhu, Tobias Weyand, Weijun Wang

Pith reviewed 2026-05-11 02:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords MobileNetsdepth-wise separable convolutionsefficient convolutional networksmobile vision applicationswidth and resolution multipliersImageNet classificationmodel scaling

0 comments

The pith

MobileNets use depth-wise separable convolutions and two global scaling hyperparameters to build lightweight networks that trade latency for accuracy on mobile devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MobileNets as a family of efficient convolutional neural networks designed specifically for mobile and embedded vision applications. These models replace standard convolutions with depth-wise separable operations that factor spatial filtering from channel mixing to reduce computation and parameters. Two simple hyperparameters, one for width and one for resolution, allow uniform scaling of the entire network to hit different speed and accuracy targets. Extensive tests on ImageNet show competitive accuracy for the given resources, and the same models transfer to object detection, fine-grained classification, face attributes, and geo-localization without custom redesign.

Core claim

MobileNets are built from depth-wise separable convolutions that split each standard convolution into a per-channel spatial filter followed by a 1x1 point-wise combination, producing far fewer operations. The architecture adds two global hyperparameters: a width multiplier that uniformly reduces the number of channels across layers, and a resolution multiplier that shrinks the input image size. These parameters let a single base design generate a range of models that match the latency budgets of different mobile hardware while keeping enough capacity for high accuracy on vision tasks.

What carries the argument

Depth-wise separable convolutions that separate spatial and channel operations, combined with uniform width and resolution multipliers for global scaling.

If this is right

A single architecture family can generate models sized to fit specific hardware limits and accuracy needs.
The networks achieve strong accuracy-latency tradeoffs on ImageNet classification relative to other common models.
The same models transfer effectively to object detection, fine-grain classification, face attribute prediction, and large-scale geo-localization.
Model builders can select the right size using only the two global hyperparameters instead of redesigning the network.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The uniform scaling approach could be tested on other convolution-based architectures to see if similar efficiency gains appear without full redesign.
Real-time vision pipelines on edge hardware become more practical when accuracy can be dialed to match available compute.
Combining these separable blocks with input-dependent scaling might further reduce average latency on varied data.

Load-bearing premise

Depth-wise separable convolutions plus uniform width and resolution scaling preserve enough representational power across the full range of target tasks and hardware constraints without needing task-specific redesign.

What would settle it

Measuring that a MobileNet variant with a chosen width and resolution multiplier falls well below its predicted ImageNet accuracy or fails to produce usable results on a mobile device for object detection would disprove the claim that the scaling method works across constraints.

read the original abstract

We present a class of efficient models called MobileNets for mobile and embedded vision applications. MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency and accuracy. These hyper-parameters allow the model builder to choose the right sized model for their application based on the constraints of the problem. We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification. We then demonstrate the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 0 minor

Summary. The manuscript introduces MobileNets, a family of lightweight convolutional neural networks designed for mobile and embedded vision applications. The architecture relies on depth-wise separable convolutions and introduces two global hyperparameters (width multiplier and resolution multiplier) that allow trading off between model accuracy and computational efficiency (latency and size). The authors provide extensive empirical evaluation on ImageNet classification and show the models' utility in several downstream tasks such as object detection, fine-grained classification, face attribute classification, and geo-localization.

Significance. If the central claims hold, this work has high significance for the field of efficient deep learning. It demonstrates that a simple architectural choice combined with straightforward scaling rules can produce models that achieve good accuracy-latency trade-offs across a range of vision applications. The transparent presentation of results on held-out data and the applicability to multiple tasks without per-task redesign are strengths that could influence subsequent research on mobile-optimized networks.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the work's significance, and recommendation to accept the manuscript. No major comments were provided for us to address.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The MobileNets architecture is defined directly via depthwise separable convolutions (a pre-existing factorization) plus two explicit user-selectable scalar multipliers for width and resolution. All reported results consist of measured top-1 accuracy, multiply-add counts, and latency on held-out ImageNet validation plus transfer tasks; the multipliers are not fitted inside the reported experiments but chosen by the model builder. No equation or claim reduces by construction to its own inputs, no uniqueness theorem is invoked, and no self-citation chain carries the central empirical demonstration.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The work rests on standard CNN training assumptions and the empirical observation that depthwise separable convolutions are a good efficiency-accuracy trade-off. No new physical entities or unproven mathematical axioms are introduced.

free parameters (2)

width multiplier alpha
Global scalar that uniformly reduces channel counts in every layer; chosen by the model builder to meet latency targets.
resolution multiplier rho
Scalar that reduces input image resolution; chosen by the model builder.

axioms (2)

domain assumption Depthwise separable convolutions preserve sufficient feature quality for the target vision tasks when applied uniformly across layers.
Invoked in section 3 when defining the MobileNet block and when claiming the architecture remains effective after scaling.
domain assumption Standard ImageNet training (SGD, data augmentation, etc.) produces representative accuracy numbers for mobile deployment.
Used throughout the experimental section without additional justification.

pith-pipeline@v0.9.0 · 5436 in / 1479 out tokens · 29210 ms · 2026-05-11T02:45:24.717356+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear
MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency and accuracy.
Foundation.DimensionForcing dimension_forced unclear
We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification.

Forward citations

Cited by 47 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

KAConvNet: Kolmogorov-Arnold Convolutional Networks for Vision Recognition
cs.CV 2026-04 unverdicted novelty 7.0

KAConvNet introduces a Kolmogorov-Arnold Convolutional Layer to build networks competitive with ViTs and CNNs while offering stronger theoretical interpretability.
Scalable Neural Decoders for Practical Fault-Tolerant Quantum Computation
quant-ph 2026-04 unverdicted novelty 7.0

Neural decoder for quantum LDPC codes achieves ~10^{-10} logical error at 0.1% physical error with 17x improvement and high throughput, enabling practical fault tolerance at modest code sizes.
Multi-Head Attention based interaction-aware architecture for Bangla Handwritten Character Recognition: Introducing a Primary Dataset
cs.CV 2026-04 accept novelty 7.0

A new balanced Bangla handwritten character dataset paired with a multi-head attention hybrid model using EfficientNetB3, ViT, and Conformer achieves high accuracy and strong generalization.
Searching for Activation Functions
cs.NE 2017-10 conditional novelty 7.0

Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.
TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts
cs.CV 2026-05 unverdicted novelty 6.0

TAS-LoRA attaches a mixture of LoRA experts to a supernet and uses a dynamic router plus group-wise initialization to let different architecture subnets learn distinct features, yielding higher accuracy than prior TAS...
GTF: Omnidirectional EPI Transformer for Light Field Super-Resolution
cs.CV 2026-05 unverdicted novelty 6.0

GTF is an omnidirectional EPI Transformer for light field super-resolution that models horizontal, vertical, 45-degree and 135-degree epipolar geometries, reaching 32.78 dB on benchmarks and top ranks in the NTIRE 202...
Hardware-Aware Neural Feature Extraction for Resource-Constrained Devices
cs.LG 2026-05 unverdicted novelty 6.0

Gideon is a hardware-aware feature extractor using distillation and DNAS that achieves 111 fps on STM32N6 under 1.5 MB memory with negligible INT8 quantization loss.
EdgeSpike: Spiking Neural Networks for Low-Power Autonomous Sensing in Edge IoT Architectures
cs.NE 2026-04 unverdicted novelty 6.0

EdgeSpike delivers 91.4% mean accuracy on five sensing tasks with 31x lower energy on neuromorphic hardware and 6.3x longer battery life in a seven-month field deployment compared to conventional CNNs.
Viewport-Unaware Blind Omnidirectional Image Quality Assessment: A Unified and Generalized Approach
cs.CV 2026-04 unverdicted novelty 6.0

Blind omnidirectional image quality assessment reduces to standard 2D blind IQA by skipping viewport generation, yielding a unified model that accepts equirectangular inputs directly.
H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers
cs.CV 2026-04 unverdicted novelty 6.0

H-Sets detects higher-order feature interactions in image classifiers via Hessian-guided pair merging and attributes them with IDG-Vis to generate more interpretable saliency maps than existing marginal or coarse methods.
Co-Design of CNN Accelerators for TinyML using Approximate Matrix Decomposition
cs.AR 2026-04 unverdicted novelty 6.0

A co-design framework using approximate matrix decomposition and genetic algorithms delivers 33% average latency reduction in TinyML CNN FPGA accelerators with 1.3% average accuracy loss versus standard systolic arrays.
DroneScan-YOLO: Redundancy-Aware Lightweight Detection for Tiny Objects in UAV Imagery
cs.CV 2026-04 unverdicted novelty 6.0

DroneScan-YOLO reaches 55.3% mAP@50 and 35.6% mAP@50-95 on VisDrone2019-DET by combining 1280x1280 input, RPA-Block pruning, MSFD stride-4 branch, and SAL-NWD loss, beating YOLOv8s by 16.6 and 12.3 points with only 4....
CODO: An Automated Compiler for Comprehensive Dataflow Optimization
cs.AR 2026-04 unverdicted novelty 6.0

CODO automates comprehensive dataflow optimization on FPGAs, achieving 1.45x-4.52x speedups on kernels and up to 33.8x on DNN models over state-of-the-art frameworks.
YMIR: A new Benchmark Dataset and Model for Arabic Yemeni Music Genre Classification Using Convolutional Neural Networks
cs.SD 2026-04 conditional novelty 6.0

YMIR dataset and YMCM CNN achieve 98.8% accuracy classifying five Yemeni music genres from audio features.
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
cs.LG 2026-04 unverdicted novelty 6.0

FlashSAC scales up Soft Actor-Critic with fewer updates, larger models, higher data throughput, and norm bounds to deliver faster, more stable training than PPO on high-dimensional robot control tasks across dozens of...
TREA: Low-precision Time-Multiplexed, Resource-Efficient Edge Accelerator for Object Detection and Classification
cs.AR 2026-05 unverdicted novelty 5.0

TREA is a low-precision time-multiplexed edge accelerator using dual-precision SIMD MAC units, structured pruning, and reconfigurable activation cores to deliver up to 9x kernel-level latency reduction for object dete...
Lightning Unified Video Editing via In-Context Sparse Attention
cs.CV 2026-05 unverdicted novelty 5.0

ISA prunes low-saliency context tokens and routes queries by sharpness to either full or 0-th order Taylor sparse attention, enabling LIVEditor to cut attention latency ~60% while beating prior video editing methods o...
Memory-Efficient EDA Denoising via Knowledge Distillation for Wearable IoT Under Severe Motion Artifacts and Underwater Conditions
eess.SP 2026-05 conditional novelty 5.0

Knowledge distillation from a hybrid CNN-Transformer teacher to a depth-wise separable CNN student, combined with realistic motion and environmental augmentation, produces a 15x smaller EDA denoiser that cuts underwat...
Keypoint-based Dynamic Object 6-DoF Pose Tracking via Event Camera
cs.CV 2026-04 unverdicted novelty 5.0

A keypoint-based pipeline extracts and tracks points from event streams to compute accurate 6-DoF poses of moving objects, outperforming prior event-based methods in simulated and real tests.
DeltaSeg: Tiered Attention and Deep Delta Learning for Multi-Class Structural Defect Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

DeltaSeg, a tiered-attention U-Net variant with a novel Deep Delta Attention module, outperforms 12 prior models on two multi-class structural defect segmentation benchmarks.
Towards Topology-Aware Very Large-Scale Photonic AI Accelerators
cs.AR 2026-04 unverdicted novelty 5.0

Photonic accelerators hit a topology-driven Utilization Wall; symmetric grids improve utilization up to 6X and cut memory access over 40% versus linear layouts.
Lightweight Low-Light Image Enhancement via Distribution-Normalizing Preprocessing and Depthwise U-Net
cs.CV 2026-04 unverdicted novelty 5.0

A lightweight LLIE framework pairs frozen distribution-normalizing preprocessing with a compact depthwise U-Net to deliver competitive perceptual quality using far fewer parameters than prior methods.
Heterogeneous Connectivity in Sparse Networks: Fan-in Profiles, Gradient Hierarchy, and Topological Equilibria
cs.LG 2026-04 unverdicted novelty 5.0

Arbitrary heterogeneous fan-in profiles in sparse networks match uniform random accuracy at high sparsity, but initializing RigL dynamic sparse training with equilibrium-matched lognormal profiles improves performance...
End-to-end Automated Deep Neural Network Optimization for PPG-based Blood Pressure Estimation on Wearables
cs.LG 2026-04 unverdicted novelty 5.0

An end-to-end hardware-aware optimization pipeline produces DNNs for PPG-based blood pressure estimation with up to 7.99% lower error and 83x fewer parameters that fit on ultra-low-power SoCs like GAP8.
AsymLoc: Towards Asymmetric Feature Matching for Efficient Visual Localization
cs.CV 2026-04 unverdicted novelty 5.0

AsymLoc uses teacher-student distillation with geometry-driven matching to enable efficient nearest-neighbor feature matching, achieving 95% of teacher accuracy with 10x smaller models on localization benchmarks.
DAT-CFTNet: Speech Enhancement for Cochlear Implant Recipients using Attention-based Dual-Path Recurrent Neural Network
eess.AS 2026-04 unverdicted novelty 5.0

DAT-CFTNet combines attention-based dual-path RNN with CFTNet to improve speech enhancement and intelligibility for cochlear implant recipients in noisy conditions.
YOLOv4: Optimal Speed and Accuracy of Object Detection
cs.CV 2020-04 unverdicted novelty 5.0

YOLOv4 achieves 43.5% AP (65.7% AP50) on MS COCO at ~65 FPS on Tesla V100 by integrating WRC, CSP, CmBN, SAT, Mish activation, Mosaic augmentation, DropBlock, and CIoU loss.
Smart Railway Obstruction Detection System using IoT and Computer Vision
cs.CV 2026-05 unverdicted novelty 4.0

NETRA integrates PIR and ultrasonic sensors with edge AI on Raspberry Pi to achieve 95% intrusion detection accuracy and zero false alarms at 75% lower cost than existing optical fiber systems.
Edge Deep Learning in Computer Vision and Medical Diagnostics: A Comprehensive Survey
cs.CV 2026-05 unverdicted novelty 4.0

A comprehensive survey of edge deep learning in computer vision and medical diagnostics that presents a novel categorization of hardware platforms by performance and usage scenarios.
Lightweight Cross-Spectral Face Recognition via Contrastive Alignment and Distillation
cs.CV 2026-05 unverdicted novelty 4.0

A lightweight hybrid CNN-Transformer framework for heterogeneous face recognition achieves competitive performance on cross-spectral benchmarks and standard RGB tasks using contrastive alignment and distillation.
FoR-Net: Learning to Focus on Hard Regions for Efficient Semantic Segmentation
cs.CV 2026-05 unverdicted novelty 4.0

FoR-Net improves efficiency in semantic segmentation by focusing on hard regions with a learned selector and multi-scale convolutions, achieving competitive results on Cityscapes.
EdgeLPR: On the Deep Neural Network trade-off between Precision and Performance in LiDAR Place Recognition
cs.CV 2026-05 unverdicted novelty 4.0

FP16 quantization preserves accuracy in BEV-based LiDAR place recognition at lower cost while INT8 degradation depends on the network architecture.
A Light Weight Multi-Features-View Convolution Neural Network For Plant Disease Identification
cs.CV 2026-04 unverdicted novelty 4.0

A lightweight multi-features-view CNN achieves 2.9% higher accuracy than a standard RGB CNN on the PlantVillage dataset while remaining less computationally expensive.
CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments
cs.DC 2026-04 unverdicted novelty 4.0

Warp-tiled CUDA kernel for depthwise convolution delivers 3.26x runtime reduction versus naive baseline and 1.29x end-to-end training speedup using counter-free analysis in cloud settings.
TetrisG-SDK: Efficient Convolutional Layer Mapping with Adaptive Windows and Grouped Convolutions for Fast In-Memory Computing
cs.AR 2026-04 unverdicted novelty 4.0

TetrisG-SDK delivers 1.2-1.3x speedups and up to 70% EDAP reduction on CNN models by using adaptive windows for multi-macro parallelism and grouped convolutions on CIM hardware.
Knowledge Distillation for Lightweight Multimodal Sensing-Aided mmWave Beam Tracking
eess.SP 2026-04 conditional novelty 4.0

Knowledge distillation creates a lightweight student model that reaches over 96% top-5 beam prediction accuracy on real multimodal sensor data while using 27 times fewer parameters than the teacher.
Adaptive Data Dropout: Towards Self-Regulated Learning in Deep Neural Networks
cs.LG 2026-04 unverdicted novelty 4.0

Adaptive Data Dropout uses performance feedback to dynamically modulate training data exposure, reducing effective steps while matching static dropout accuracy on image benchmarks.
A Compact and Efficient 1.251 Million Parameter Machine Learning CNN Model PD36-C for Plant Disease Detection: A Case Study
cs.CV 2026-04 unverdicted novelty 4.0

PD36-C is a 1.25 million parameter CNN achieving 99.53% average test accuracy on 38 plant disease classes from the New Plant Diseases Dataset, with a Qt-based app enabling edge deployment.
FaceLiVTv2: An Improved Hybrid Architecture for Efficient Mobile Face Recognition
cs.CV 2026-04 unverdicted novelty 4.0

FaceLiVTv2 improves the accuracy-efficiency trade-off for mobile face recognition, cutting inference latency by 22% versus its predecessor while outperforming other lightweight models on standard benchmarks.
Digital Image Forgery Detection Using Transfer Learning
cs.CV 2026-05 unverdicted novelty 3.0

A hybrid RGB plus compression-feature transfer learning pipeline with Youden-optimized thresholds improves forgery detection on the CASIA v2.0 dataset using off-the-shelf CNN backbones.
Developing a Strong Pre-Trained Base Model for Plant Leaf Disease Classification
cs.CV 2026-05 unverdicted novelty 3.0

A DenseNet201 base model trained on a constructed plant leaf disease dataset outperforms baselines and enables faster, more robust transfer learning with less data than general models.
Image Classification via Random Dilated Convolution with Multi-Branch Feature Extraction and Context Excitation
cs.CV 2026-04 unverdicted novelty 3.0

RDCNet reports state-of-the-art accuracy on CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof by combining random dilated convolutions with multi-branch and attention modules.
Vision-Based Lane Following and Traffic Sign Recognition for Resource-Constrained Autonomous Vehicles
cs.CV 2026-04 conditional novelty 3.0

A threshold-based lane detector with perspective warp and histogram curvature plus EfficientNet-B0 achieves 3.16% max lane offset RMSE and 90% on-device sign accuracy while running real-time on resource-limited hardware.
Real-Time Cellist Postural Evaluation With On-Device Computer Vision
cs.HC 2026-04 unverdicted novelty 3.0

Cello Evaluator is a real-time postural feedback system for cellists running on current Android phones via on-device computer vision, validated as user-friendly by experts.
MS-SSE-Net: A Multi-Scale Spatial Squeeze-and-Excitation Network for Structural Damage Detection in Civil and Geotechnical Engineering
cs.CV 2026-04 unverdicted novelty 3.0

MS-SSE-Net integrates multi-scale feature extraction and squeeze-and-excitation attention into DenseNet201, reaching 99.26% accuracy on the StructDamage dataset and outperforming the baseline by about 0.73 percentage points.
A Transfer Learning Evaluation of Deep Neural Networks for Image Classification
cs.CV 2026-05 unverdicted novelty 2.0

Empirical comparison of transfer learning performance across eleven pre-trained models on five image datasets using accuracy, time, and size metrics.
NTIRE 2026 Challenge on Efficient Low Light Image Enhancement: Methods and Results
cs.CV 2026-05 unverdicted novelty 2.0

The NTIRE 2026 E-LLIE challenge evaluated 27 lightweight models for low-light image enhancement and reported advances in balancing quality with mobile efficiency.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 47 Pith papers · 5 internal anchors

[1]

Abadi, A

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorﬂow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorﬂow. org , 1,

work page 2015
[2]

W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y . Chen. Compressing neural networks with the hashing trick. CoRR, abs/1504.04788, 2015. 2

work page arXiv 2015
[3]

F. Chollet. Xception: Deep learning with depthwise separa- ble convolutions. arXiv preprint arXiv:1610.02357v2, 2016. 1

work page arXiv 2016
[4]

Courbariaux, J.-P

M. Courbariaux, J.-P. David, and Y . Bengio. Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024, 2014. 2

work page arXiv 2014
[5]

S. Han, H. Mao, and W. J. Dally. Deep compression: Com- pressing deep neural network with pruning, trained quantiza- tion and huffman coding. CoRR, abs/1510.00149, 2, 2015. 2

work page internal anchor Pith review arXiv 2015
[6]

Hays and A

J. Hays and A. Efros. IM2GPS: estimating geographic in- formation from a single image. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2008. 7

work page 2008
[7]

Hays and A

J. Hays and A. Efros. Large-Scale Image Geolocalization. In J. Choi and G. Friedland, editors, Multimodal Location Estimation of Videos and Images. Springer, 2014. 6, 7

work page 2014
[8]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. arXiv preprint arXiv:1512.03385,

work page internal anchor Pith review arXiv
[9]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

Huang, V

J. Huang, V . Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y . Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. arXiv preprint arXiv:1611.10012, 2016. 7

work page arXiv 2016
[11]

Hubara, M

I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y . Bengio. Quantized neural networks: Training neural net- works with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016. 2

work page arXiv 2016
[12]

F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 1mb model size. arXiv preprint arXiv:1602.07360, 2016. 1, 6

work page arXiv 2016
[13]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 1, 3, 7

work page internal anchor Pith review arXiv 2015
[14]

Speeding up convo- lutional neural networks with low rank expansions,

M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014. 2

work page arXiv 2014
[15]

Y . Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- tional architecture for fast feature embedding.arXiv preprint arXiv:1408.5093, 2014. 4

work page arXiv 2014
[16]

J. Jin, A. Dundar, and E. Culurciello. Flattened convolutional neural networks for feedforward acceleration. arXiv preprint arXiv:1412.5474, 2014. 1, 3

work page arXiv 2014
[17]

Khosla, N

A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for ﬁne-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition , Colorado Springs, CO, June 2011. 6

work page 2011
[18]

Krause, B

J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, and L. Fei-Fei. The unreasonable ef- fectiveness of noisy data for ﬁne-grained recognition. arXiv preprint arXiv:1511.06789, 2015. 6

work page arXiv 2015
[19]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems , pages 1097–1105, 2012. 1, 6

work page 2012
[20]

Lebedev, Y

V . Lebedev, Y . Ganin, M. Rakhuba, I. Oseledets, and V . Lempitsky. Speeding-up convolutional neural net- works using ﬁne-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014. 2

work page arXiv 2014
[21]

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. Ssd: Single shot multibox detector. arXiv preprint arXiv:1512.02325, 2015. 7

work page arXiv 2015
[22]

Rastegari, V

M. Rastegari, V . Ordonez, J. Redmon, and A. Farhadi. Xnor- net: Imagenet classiﬁcation using binary convolutional neu- ral networks. arXiv preprint arXiv:1603.05279, 2016. 1, 2

work page arXiv 2016
[23]

S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems , pages 91–99, 2015. 7

work page 2015
[24]

Russakovsky, J

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision , 115(3):211–252,

work page
[25]

Schroff, D

F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni- ﬁed embedding for face recognition and clustering. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015. 8

work page 2015
[26]

L. Sifre. Rigid-motion scattering for image classiﬁcation . PhD thesis, Ph. D. thesis, 2014. 1, 3

work page 2014
[27]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2014
[28]

Sindhwani, T

V . Sindhwani, T. Sainath, and S. Kumar. Structured trans- forms for small-footprint deep learning. In Advances in Neural Information Processing Systems , pages 3088–3096,

work page
[29]

Inception-v4, inception- resnet and the impact of residual connections on learning

C. Szegedy, S. Ioffe, and V . Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016. 1

work page arXiv 2016
[30]

Szegedy, W

C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 1–9, 2015. 6

work page 2015
[31]

Rethinking the inception architecture for computer vision

C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015. 1, 3, 4, 7

work page arXiv 2015
[32]

Thomee, D

B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. Yfcc100m: The new data in multimedia research. Communications of the ACM , 59(2):64–73, 2016. 7

work page 2016
[33]

Tieleman and G

T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning , 4(2),

work page
[34]

M. Wang, B. Liu, and H. Foroosh. Factorized convolutional neural networks. arXiv preprint arXiv:1608.04337, 2016. 1

work page arXiv 2016
[35]

Weyand, I

T. Weyand, I. Kostrikov, and J. Philbin. PlaNet - Photo Ge- olocation with Convolutional Neural Networks. InEuropean Conference on Computer Vision (ECCV), 2016. 6, 7

work page 2016
[36]

J. Wu, C. Leng, Y . Wang, Q. Hu, and J. Cheng. Quantized convolutional neural networks for mobile devices. arXiv preprint arXiv:1512.06473, 2015. 1

work page arXiv 2015
[37]

Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. Wang. Deep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision , pages 1476–1483, 2015. 1

work page 2015