pith. machine review for the scientific record. sign in

arxiv: 1704.04861 · v1 · submitted 2017-04-17 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G. Howard, Bo Chen, Dmitry Kalenichenko, Hartwig Adam, Marco Andreetto, Menglong Zhu, Tobias Weyand, Weijun Wang

Pith reviewed 2026-05-11 02:45 UTC · model grok-4.3

classification 💻 cs.CV
keywords MobileNetsdepth-wise separable convolutionsefficient convolutional networksmobile vision applicationswidth and resolution multipliersImageNet classificationmodel scaling
0
0 comments X

The pith

MobileNets use depth-wise separable convolutions and two global scaling hyperparameters to build lightweight networks that trade latency for accuracy on mobile devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MobileNets as a family of efficient convolutional neural networks designed specifically for mobile and embedded vision applications. These models replace standard convolutions with depth-wise separable operations that factor spatial filtering from channel mixing to reduce computation and parameters. Two simple hyperparameters, one for width and one for resolution, allow uniform scaling of the entire network to hit different speed and accuracy targets. Extensive tests on ImageNet show competitive accuracy for the given resources, and the same models transfer to object detection, fine-grained classification, face attributes, and geo-localization without custom redesign.

Core claim

MobileNets are built from depth-wise separable convolutions that split each standard convolution into a per-channel spatial filter followed by a 1x1 point-wise combination, producing far fewer operations. The architecture adds two global hyperparameters: a width multiplier that uniformly reduces the number of channels across layers, and a resolution multiplier that shrinks the input image size. These parameters let a single base design generate a range of models that match the latency budgets of different mobile hardware while keeping enough capacity for high accuracy on vision tasks.

What carries the argument

Depth-wise separable convolutions that separate spatial and channel operations, combined with uniform width and resolution multipliers for global scaling.

If this is right

  • A single architecture family can generate models sized to fit specific hardware limits and accuracy needs.
  • The networks achieve strong accuracy-latency tradeoffs on ImageNet classification relative to other common models.
  • The same models transfer effectively to object detection, fine-grain classification, face attribute prediction, and large-scale geo-localization.
  • Model builders can select the right size using only the two global hyperparameters instead of redesigning the network.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The uniform scaling approach could be tested on other convolution-based architectures to see if similar efficiency gains appear without full redesign.
  • Real-time vision pipelines on edge hardware become more practical when accuracy can be dialed to match available compute.
  • Combining these separable blocks with input-dependent scaling might further reduce average latency on varied data.

Load-bearing premise

Depth-wise separable convolutions plus uniform width and resolution scaling preserve enough representational power across the full range of target tasks and hardware constraints without needing task-specific redesign.

What would settle it

Measuring that a MobileNet variant with a chosen width and resolution multiplier falls well below its predicted ImageNet accuracy or fails to produce usable results on a mobile device for object detection would disprove the claim that the scaling method works across constraints.

read the original abstract

We present a class of efficient models called MobileNets for mobile and embedded vision applications. MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency and accuracy. These hyper-parameters allow the model builder to choose the right sized model for their application based on the constraints of the problem. We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification. We then demonstrate the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 0 minor

Summary. The manuscript introduces MobileNets, a family of lightweight convolutional neural networks designed for mobile and embedded vision applications. The architecture relies on depth-wise separable convolutions and introduces two global hyperparameters (width multiplier and resolution multiplier) that allow trading off between model accuracy and computational efficiency (latency and size). The authors provide extensive empirical evaluation on ImageNet classification and show the models' utility in several downstream tasks such as object detection, fine-grained classification, face attribute classification, and geo-localization.

Significance. If the central claims hold, this work has high significance for the field of efficient deep learning. It demonstrates that a simple architectural choice combined with straightforward scaling rules can produce models that achieve good accuracy-latency trade-offs across a range of vision applications. The transparent presentation of results on held-out data and the applicability to multiple tasks without per-task redesign are strengths that could influence subsequent research on mobile-optimized networks.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the work's significance, and recommendation to accept the manuscript. No major comments were provided for us to address.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The MobileNets architecture is defined directly via depthwise separable convolutions (a pre-existing factorization) plus two explicit user-selectable scalar multipliers for width and resolution. All reported results consist of measured top-1 accuracy, multiply-add counts, and latency on held-out ImageNet validation plus transfer tasks; the multipliers are not fitted inside the reported experiments but chosen by the model builder. No equation or claim reduces by construction to its own inputs, no uniqueness theorem is invoked, and no self-citation chain carries the central empirical demonstration.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The work rests on standard CNN training assumptions and the empirical observation that depthwise separable convolutions are a good efficiency-accuracy trade-off. No new physical entities or unproven mathematical axioms are introduced.

free parameters (2)
  • width multiplier alpha
    Global scalar that uniformly reduces channel counts in every layer; chosen by the model builder to meet latency targets.
  • resolution multiplier rho
    Scalar that reduces input image resolution; chosen by the model builder.
axioms (2)
  • domain assumption Depthwise separable convolutions preserve sufficient feature quality for the target vision tasks when applied uniformly across layers.
    Invoked in section 3 when defining the MobileNet block and when claiming the architecture remains effective after scaling.
  • domain assumption Standard ImageNet training (SGD, data augmentation, etc.) produces representative accuracy numbers for mobile deployment.
    Used throughout the experimental section without additional justification.

pith-pipeline@v0.9.0 · 5436 in / 1479 out tokens · 29210 ms · 2026-05-11T02:45:24.717356+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear

    MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency and accuracy.

  • Foundation.DimensionForcing dimension_forced unclear

    We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification.

Forward citations

Cited by 47 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. KAConvNet: Kolmogorov-Arnold Convolutional Networks for Vision Recognition

    cs.CV 2026-04 unverdicted novelty 7.0

    KAConvNet introduces a Kolmogorov-Arnold Convolutional Layer to build networks competitive with ViTs and CNNs while offering stronger theoretical interpretability.

  2. Scalable Neural Decoders for Practical Fault-Tolerant Quantum Computation

    quant-ph 2026-04 unverdicted novelty 7.0

    Neural decoder for quantum LDPC codes achieves ~10^{-10} logical error at 0.1% physical error with 17x improvement and high throughput, enabling practical fault tolerance at modest code sizes.

  3. Multi-Head Attention based interaction-aware architecture for Bangla Handwritten Character Recognition: Introducing a Primary Dataset

    cs.CV 2026-04 accept novelty 7.0

    A new balanced Bangla handwritten character dataset paired with a multi-head attention hybrid model using EfficientNetB3, ViT, and Conformer achieves high accuracy and strong generalization.

  4. Searching for Activation Functions

    cs.NE 2017-10 conditional novelty 7.0

    Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.

  5. TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts

    cs.CV 2026-05 unverdicted novelty 6.0

    TAS-LoRA attaches a mixture of LoRA experts to a supernet and uses a dynamic router plus group-wise initialization to let different architecture subnets learn distinct features, yielding higher accuracy than prior TAS...

  6. GTF: Omnidirectional EPI Transformer for Light Field Super-Resolution

    cs.CV 2026-05 unverdicted novelty 6.0

    GTF is an omnidirectional EPI Transformer for light field super-resolution that models horizontal, vertical, 45-degree and 135-degree epipolar geometries, reaching 32.78 dB on benchmarks and top ranks in the NTIRE 202...

  7. Hardware-Aware Neural Feature Extraction for Resource-Constrained Devices

    cs.LG 2026-05 unverdicted novelty 6.0

    Gideon is a hardware-aware feature extractor using distillation and DNAS that achieves 111 fps on STM32N6 under 1.5 MB memory with negligible INT8 quantization loss.

  8. EdgeSpike: Spiking Neural Networks for Low-Power Autonomous Sensing in Edge IoT Architectures

    cs.NE 2026-04 unverdicted novelty 6.0

    EdgeSpike delivers 91.4% mean accuracy on five sensing tasks with 31x lower energy on neuromorphic hardware and 6.3x longer battery life in a seven-month field deployment compared to conventional CNNs.

  9. Viewport-Unaware Blind Omnidirectional Image Quality Assessment: A Unified and Generalized Approach

    cs.CV 2026-04 unverdicted novelty 6.0

    Blind omnidirectional image quality assessment reduces to standard 2D blind IQA by skipping viewport generation, yielding a unified model that accepts equirectangular inputs directly.

  10. H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers

    cs.CV 2026-04 unverdicted novelty 6.0

    H-Sets detects higher-order feature interactions in image classifiers via Hessian-guided pair merging and attributes them with IDG-Vis to generate more interpretable saliency maps than existing marginal or coarse methods.

  11. Co-Design of CNN Accelerators for TinyML using Approximate Matrix Decomposition

    cs.AR 2026-04 unverdicted novelty 6.0

    A co-design framework using approximate matrix decomposition and genetic algorithms delivers 33% average latency reduction in TinyML CNN FPGA accelerators with 1.3% average accuracy loss versus standard systolic arrays.

  12. DroneScan-YOLO: Redundancy-Aware Lightweight Detection for Tiny Objects in UAV Imagery

    cs.CV 2026-04 unverdicted novelty 6.0

    DroneScan-YOLO reaches 55.3% mAP@50 and 35.6% mAP@50-95 on VisDrone2019-DET by combining 1280x1280 input, RPA-Block pruning, MSFD stride-4 branch, and SAL-NWD loss, beating YOLOv8s by 16.6 and 12.3 points with only 4....

  13. CODO: An Automated Compiler for Comprehensive Dataflow Optimization

    cs.AR 2026-04 unverdicted novelty 6.0

    CODO automates comprehensive dataflow optimization on FPGAs, achieving 1.45x-4.52x speedups on kernels and up to 33.8x on DNN models over state-of-the-art frameworks.

  14. YMIR: A new Benchmark Dataset and Model for Arabic Yemeni Music Genre Classification Using Convolutional Neural Networks

    cs.SD 2026-04 conditional novelty 6.0

    YMIR dataset and YMCM CNN achieve 98.8% accuracy classifying five Yemeni music genres from audio features.

  15. FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

    cs.LG 2026-04 unverdicted novelty 6.0

    FlashSAC scales up Soft Actor-Critic with fewer updates, larger models, higher data throughput, and norm bounds to deliver faster, more stable training than PPO on high-dimensional robot control tasks across dozens of...

  16. TREA: Low-precision Time-Multiplexed, Resource-Efficient Edge Accelerator for Object Detection and Classification

    cs.AR 2026-05 unverdicted novelty 5.0

    TREA is a low-precision time-multiplexed edge accelerator using dual-precision SIMD MAC units, structured pruning, and reconfigurable activation cores to deliver up to 9x kernel-level latency reduction for object dete...

  17. Lightning Unified Video Editing via In-Context Sparse Attention

    cs.CV 2026-05 unverdicted novelty 5.0

    ISA prunes low-saliency context tokens and routes queries by sharpness to either full or 0-th order Taylor sparse attention, enabling LIVEditor to cut attention latency ~60% while beating prior video editing methods o...

  18. Memory-Efficient EDA Denoising via Knowledge Distillation for Wearable IoT Under Severe Motion Artifacts and Underwater Conditions

    eess.SP 2026-05 conditional novelty 5.0

    Knowledge distillation from a hybrid CNN-Transformer teacher to a depth-wise separable CNN student, combined with realistic motion and environmental augmentation, produces a 15x smaller EDA denoiser that cuts underwat...

  19. Keypoint-based Dynamic Object 6-DoF Pose Tracking via Event Camera

    cs.CV 2026-04 unverdicted novelty 5.0

    A keypoint-based pipeline extracts and tracks points from event streams to compute accurate 6-DoF poses of moving objects, outperforming prior event-based methods in simulated and real tests.

  20. DeltaSeg: Tiered Attention and Deep Delta Learning for Multi-Class Structural Defect Segmentation

    cs.CV 2026-04 unverdicted novelty 5.0

    DeltaSeg, a tiered-attention U-Net variant with a novel Deep Delta Attention module, outperforms 12 prior models on two multi-class structural defect segmentation benchmarks.

  21. Towards Topology-Aware Very Large-Scale Photonic AI Accelerators

    cs.AR 2026-04 unverdicted novelty 5.0

    Photonic accelerators hit a topology-driven Utilization Wall; symmetric grids improve utilization up to 6X and cut memory access over 40% versus linear layouts.

  22. Lightweight Low-Light Image Enhancement via Distribution-Normalizing Preprocessing and Depthwise U-Net

    cs.CV 2026-04 unverdicted novelty 5.0

    A lightweight LLIE framework pairs frozen distribution-normalizing preprocessing with a compact depthwise U-Net to deliver competitive perceptual quality using far fewer parameters than prior methods.

  23. Heterogeneous Connectivity in Sparse Networks: Fan-in Profiles, Gradient Hierarchy, and Topological Equilibria

    cs.LG 2026-04 unverdicted novelty 5.0

    Arbitrary heterogeneous fan-in profiles in sparse networks match uniform random accuracy at high sparsity, but initializing RigL dynamic sparse training with equilibrium-matched lognormal profiles improves performance...

  24. End-to-end Automated Deep Neural Network Optimization for PPG-based Blood Pressure Estimation on Wearables

    cs.LG 2026-04 unverdicted novelty 5.0

    An end-to-end hardware-aware optimization pipeline produces DNNs for PPG-based blood pressure estimation with up to 7.99% lower error and 83x fewer parameters that fit on ultra-low-power SoCs like GAP8.

  25. AsymLoc: Towards Asymmetric Feature Matching for Efficient Visual Localization

    cs.CV 2026-04 unverdicted novelty 5.0

    AsymLoc uses teacher-student distillation with geometry-driven matching to enable efficient nearest-neighbor feature matching, achieving 95% of teacher accuracy with 10x smaller models on localization benchmarks.

  26. DAT-CFTNet: Speech Enhancement for Cochlear Implant Recipients using Attention-based Dual-Path Recurrent Neural Network

    eess.AS 2026-04 unverdicted novelty 5.0

    DAT-CFTNet combines attention-based dual-path RNN with CFTNet to improve speech enhancement and intelligibility for cochlear implant recipients in noisy conditions.

  27. YOLOv4: Optimal Speed and Accuracy of Object Detection

    cs.CV 2020-04 unverdicted novelty 5.0

    YOLOv4 achieves 43.5% AP (65.7% AP50) on MS COCO at ~65 FPS on Tesla V100 by integrating WRC, CSP, CmBN, SAT, Mish activation, Mosaic augmentation, DropBlock, and CIoU loss.

  28. Smart Railway Obstruction Detection System using IoT and Computer Vision

    cs.CV 2026-05 unverdicted novelty 4.0

    NETRA integrates PIR and ultrasonic sensors with edge AI on Raspberry Pi to achieve 95% intrusion detection accuracy and zero false alarms at 75% lower cost than existing optical fiber systems.

  29. Edge Deep Learning in Computer Vision and Medical Diagnostics: A Comprehensive Survey

    cs.CV 2026-05 unverdicted novelty 4.0

    A comprehensive survey of edge deep learning in computer vision and medical diagnostics that presents a novel categorization of hardware platforms by performance and usage scenarios.

  30. Lightweight Cross-Spectral Face Recognition via Contrastive Alignment and Distillation

    cs.CV 2026-05 unverdicted novelty 4.0

    A lightweight hybrid CNN-Transformer framework for heterogeneous face recognition achieves competitive performance on cross-spectral benchmarks and standard RGB tasks using contrastive alignment and distillation.

  31. FoR-Net: Learning to Focus on Hard Regions for Efficient Semantic Segmentation

    cs.CV 2026-05 unverdicted novelty 4.0

    FoR-Net improves efficiency in semantic segmentation by focusing on hard regions with a learned selector and multi-scale convolutions, achieving competitive results on Cityscapes.

  32. EdgeLPR: On the Deep Neural Network trade-off between Precision and Performance in LiDAR Place Recognition

    cs.CV 2026-05 unverdicted novelty 4.0

    FP16 quantization preserves accuracy in BEV-based LiDAR place recognition at lower cost while INT8 degradation depends on the network architecture.

  33. A Light Weight Multi-Features-View Convolution Neural Network For Plant Disease Identification

    cs.CV 2026-04 unverdicted novelty 4.0

    A lightweight multi-features-view CNN achieves 2.9% higher accuracy than a standard RGB CNN on the PlantVillage dataset while remaining less computationally expensive.

  34. CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments

    cs.DC 2026-04 unverdicted novelty 4.0

    Warp-tiled CUDA kernel for depthwise convolution delivers 3.26x runtime reduction versus naive baseline and 1.29x end-to-end training speedup using counter-free analysis in cloud settings.

  35. TetrisG-SDK: Efficient Convolutional Layer Mapping with Adaptive Windows and Grouped Convolutions for Fast In-Memory Computing

    cs.AR 2026-04 unverdicted novelty 4.0

    TetrisG-SDK delivers 1.2-1.3x speedups and up to 70% EDAP reduction on CNN models by using adaptive windows for multi-macro parallelism and grouped convolutions on CIM hardware.

  36. Knowledge Distillation for Lightweight Multimodal Sensing-Aided mmWave Beam Tracking

    eess.SP 2026-04 conditional novelty 4.0

    Knowledge distillation creates a lightweight student model that reaches over 96% top-5 beam prediction accuracy on real multimodal sensor data while using 27 times fewer parameters than the teacher.

  37. Adaptive Data Dropout: Towards Self-Regulated Learning in Deep Neural Networks

    cs.LG 2026-04 unverdicted novelty 4.0

    Adaptive Data Dropout uses performance feedback to dynamically modulate training data exposure, reducing effective steps while matching static dropout accuracy on image benchmarks.

  38. A Compact and Efficient 1.251 Million Parameter Machine Learning CNN Model PD36-C for Plant Disease Detection: A Case Study

    cs.CV 2026-04 unverdicted novelty 4.0

    PD36-C is a 1.25 million parameter CNN achieving 99.53% average test accuracy on 38 plant disease classes from the New Plant Diseases Dataset, with a Qt-based app enabling edge deployment.

  39. FaceLiVTv2: An Improved Hybrid Architecture for Efficient Mobile Face Recognition

    cs.CV 2026-04 unverdicted novelty 4.0

    FaceLiVTv2 improves the accuracy-efficiency trade-off for mobile face recognition, cutting inference latency by 22% versus its predecessor while outperforming other lightweight models on standard benchmarks.

  40. Digital Image Forgery Detection Using Transfer Learning

    cs.CV 2026-05 unverdicted novelty 3.0

    A hybrid RGB plus compression-feature transfer learning pipeline with Youden-optimized thresholds improves forgery detection on the CASIA v2.0 dataset using off-the-shelf CNN backbones.

  41. Developing a Strong Pre-Trained Base Model for Plant Leaf Disease Classification

    cs.CV 2026-05 unverdicted novelty 3.0

    A DenseNet201 base model trained on a constructed plant leaf disease dataset outperforms baselines and enables faster, more robust transfer learning with less data than general models.

  42. Image Classification via Random Dilated Convolution with Multi-Branch Feature Extraction and Context Excitation

    cs.CV 2026-04 unverdicted novelty 3.0

    RDCNet reports state-of-the-art accuracy on CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof by combining random dilated convolutions with multi-branch and attention modules.

  43. Vision-Based Lane Following and Traffic Sign Recognition for Resource-Constrained Autonomous Vehicles

    cs.CV 2026-04 conditional novelty 3.0

    A threshold-based lane detector with perspective warp and histogram curvature plus EfficientNet-B0 achieves 3.16% max lane offset RMSE and 90% on-device sign accuracy while running real-time on resource-limited hardware.

  44. Real-Time Cellist Postural Evaluation With On-Device Computer Vision

    cs.HC 2026-04 unverdicted novelty 3.0

    Cello Evaluator is a real-time postural feedback system for cellists running on current Android phones via on-device computer vision, validated as user-friendly by experts.

  45. MS-SSE-Net: A Multi-Scale Spatial Squeeze-and-Excitation Network for Structural Damage Detection in Civil and Geotechnical Engineering

    cs.CV 2026-04 unverdicted novelty 3.0

    MS-SSE-Net integrates multi-scale feature extraction and squeeze-and-excitation attention into DenseNet201, reaching 99.26% accuracy on the StructDamage dataset and outperforming the baseline by about 0.73 percentage points.

  46. A Transfer Learning Evaluation of Deep Neural Networks for Image Classification

    cs.CV 2026-05 unverdicted novelty 2.0

    Empirical comparison of transfer learning performance across eleven pre-trained models on five image datasets using accuracy, time, and size metrics.

  47. NTIRE 2026 Challenge on Efficient Low Light Image Enhancement: Methods and Results

    cs.CV 2026-05 unverdicted novelty 2.0

    The NTIRE 2026 E-LLIE challenge evaluated 27 lightweight models for low-light image enhancement and reported advances in balancing quality with mobile efficiency.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 47 Pith papers · 5 internal anchors

  1. [1]

    Abadi, A

    M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow. org , 1,

  2. [2]

    W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y . Chen. Compressing neural networks with the hashing trick. CoRR, abs/1504.04788, 2015. 2

  3. [3]

    F. Chollet. Xception: Deep learning with depthwise separa- ble convolutions. arXiv preprint arXiv:1610.02357v2, 2016. 1

  4. [4]

    Courbariaux, J.-P

    M. Courbariaux, J.-P. David, and Y . Bengio. Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024, 2014. 2

  5. [5]

    S. Han, H. Mao, and W. J. Dally. Deep compression: Com- pressing deep neural network with pruning, trained quantiza- tion and huffman coding. CoRR, abs/1510.00149, 2, 2015. 2

  6. [6]

    Hays and A

    J. Hays and A. Efros. IM2GPS: estimating geographic in- formation from a single image. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2008. 7

  7. [7]

    Hays and A

    J. Hays and A. Efros. Large-Scale Image Geolocalization. In J. Choi and G. Friedland, editors, Multimodal Location Estimation of Videos and Images. Springer, 2014. 6, 7

  8. [8]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. arXiv preprint arXiv:1512.03385,

  9. [9]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 2, 7

  10. [10]

    Huang, V

    J. Huang, V . Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y . Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. arXiv preprint arXiv:1611.10012, 2016. 7

  11. [11]

    Hubara, M

    I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y . Bengio. Quantized neural networks: Training neural net- works with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016. 2

  12. [12]

    F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 1mb model size. arXiv preprint arXiv:1602.07360, 2016. 1, 6

  13. [13]

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 1, 3, 7

  14. [14]

    Speeding up convo- lutional neural networks with low rank expansions,

    M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014. 2

  15. [15]

    Y . Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- tional architecture for fast feature embedding.arXiv preprint arXiv:1408.5093, 2014. 4

  16. [16]

    J. Jin, A. Dundar, and E. Culurciello. Flattened convolutional neural networks for feedforward acceleration. arXiv preprint arXiv:1412.5474, 2014. 1, 3

  17. [17]

    Khosla, N

    A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition , Colorado Springs, CO, June 2011. 6

  18. [18]

    Krause, B

    J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, and L. Fei-Fei. The unreasonable ef- fectiveness of noisy data for fine-grained recognition. arXiv preprint arXiv:1511.06789, 2015. 6

  19. [19]

    Krizhevsky, I

    A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems , pages 1097–1105, 2012. 1, 6

  20. [20]

    Lebedev, Y

    V . Lebedev, Y . Ganin, M. Rakhuba, I. Oseledets, and V . Lempitsky. Speeding-up convolutional neural net- works using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014. 2

  21. [21]

    W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. Ssd: Single shot multibox detector. arXiv preprint arXiv:1512.02325, 2015. 7

  22. [22]

    Rastegari, V

    M. Rastegari, V . Ordonez, J. Redmon, and A. Farhadi. Xnor- net: Imagenet classification using binary convolutional neu- ral networks. arXiv preprint arXiv:1603.05279, 2016. 1, 2

  23. [23]

    S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems , pages 91–99, 2015. 7

  24. [24]

    Russakovsky, J

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision , 115(3):211–252,

  25. [25]

    Schroff, D

    F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni- fied embedding for face recognition and clustering. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015. 8

  26. [26]

    L. Sifre. Rigid-motion scattering for image classification . PhD thesis, Ph. D. thesis, 2014. 1, 3

  27. [27]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 1, 6

  28. [28]

    Sindhwani, T

    V . Sindhwani, T. Sainath, and S. Kumar. Structured trans- forms for small-footprint deep learning. In Advances in Neural Information Processing Systems , pages 3088–3096,

  29. [29]

    Inception-v4, inception- resnet and the impact of residual connections on learning

    C. Szegedy, S. Ioffe, and V . Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016. 1

  30. [30]

    Szegedy, W

    C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 1–9, 2015. 6

  31. [31]

    Rethinking the inception architecture for computer vision

    C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015. 1, 3, 4, 7

  32. [32]

    Thomee, D

    B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. Yfcc100m: The new data in multimedia research. Communications of the ACM , 59(2):64–73, 2016. 7

  33. [33]

    Tieleman and G

    T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning , 4(2),

  34. [34]

    M. Wang, B. Liu, and H. Foroosh. Factorized convolutional neural networks. arXiv preprint arXiv:1608.04337, 2016. 1

  35. [35]

    Weyand, I

    T. Weyand, I. Kostrikov, and J. Philbin. PlaNet - Photo Ge- olocation with Convolutional Neural Networks. InEuropean Conference on Computer Vision (ECCV), 2016. 6, 7

  36. [36]

    J. Wu, C. Leng, Y . Wang, Q. Hu, and J. Cheng. Quantized convolutional neural networks for mobile devices. arXiv preprint arXiv:1512.06473, 2015. 1

  37. [37]

    Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. Wang. Deep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision , pages 1476–1483, 2015. 1