MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Pith reviewed 2026-05-11 02:45 UTC · model grok-4.3
The pith
MobileNets use depth-wise separable convolutions and two global scaling hyperparameters to build lightweight networks that trade latency for accuracy on mobile devices.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MobileNets are built from depth-wise separable convolutions that split each standard convolution into a per-channel spatial filter followed by a 1x1 point-wise combination, producing far fewer operations. The architecture adds two global hyperparameters: a width multiplier that uniformly reduces the number of channels across layers, and a resolution multiplier that shrinks the input image size. These parameters let a single base design generate a range of models that match the latency budgets of different mobile hardware while keeping enough capacity for high accuracy on vision tasks.
What carries the argument
Depth-wise separable convolutions that separate spatial and channel operations, combined with uniform width and resolution multipliers for global scaling.
If this is right
- A single architecture family can generate models sized to fit specific hardware limits and accuracy needs.
- The networks achieve strong accuracy-latency tradeoffs on ImageNet classification relative to other common models.
- The same models transfer effectively to object detection, fine-grain classification, face attribute prediction, and large-scale geo-localization.
- Model builders can select the right size using only the two global hyperparameters instead of redesigning the network.
Where Pith is reading between the lines
- The uniform scaling approach could be tested on other convolution-based architectures to see if similar efficiency gains appear without full redesign.
- Real-time vision pipelines on edge hardware become more practical when accuracy can be dialed to match available compute.
- Combining these separable blocks with input-dependent scaling might further reduce average latency on varied data.
Load-bearing premise
Depth-wise separable convolutions plus uniform width and resolution scaling preserve enough representational power across the full range of target tasks and hardware constraints without needing task-specific redesign.
What would settle it
Measuring that a MobileNet variant with a chosen width and resolution multiplier falls well below its predicted ImageNet accuracy or fails to produce usable results on a mobile device for object detection would disprove the claim that the scaling method works across constraints.
read the original abstract
We present a class of efficient models called MobileNets for mobile and embedded vision applications. MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency and accuracy. These hyper-parameters allow the model builder to choose the right sized model for their application based on the constraints of the problem. We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification. We then demonstrate the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MobileNets, a family of lightweight convolutional neural networks designed for mobile and embedded vision applications. The architecture relies on depth-wise separable convolutions and introduces two global hyperparameters (width multiplier and resolution multiplier) that allow trading off between model accuracy and computational efficiency (latency and size). The authors provide extensive empirical evaluation on ImageNet classification and show the models' utility in several downstream tasks such as object detection, fine-grained classification, face attribute classification, and geo-localization.
Significance. If the central claims hold, this work has high significance for the field of efficient deep learning. It demonstrates that a simple architectural choice combined with straightforward scaling rules can produce models that achieve good accuracy-latency trade-offs across a range of vision applications. The transparent presentation of results on held-out data and the applicability to multiple tasks without per-task redesign are strengths that could influence subsequent research on mobile-optimized networks.
Simulated Author's Rebuttal
We thank the referee for their positive summary, recognition of the work's significance, and recommendation to accept the manuscript. No major comments were provided for us to address.
Circularity Check
No significant circularity identified
full rationale
The MobileNets architecture is defined directly via depthwise separable convolutions (a pre-existing factorization) plus two explicit user-selectable scalar multipliers for width and resolution. All reported results consist of measured top-1 accuracy, multiply-add counts, and latency on held-out ImageNet validation plus transfer tasks; the multipliers are not fitted inside the reported experiments but chosen by the model builder. No equation or claim reduces by construction to its own inputs, no uniqueness theorem is invoked, and no self-citation chain carries the central empirical demonstration.
Axiom & Free-Parameter Ledger
free parameters (2)
- width multiplier alpha
- resolution multiplier rho
axioms (2)
- domain assumption Depthwise separable convolutions preserve sufficient feature quality for the target vision tasks when applied uniformly across layers.
- domain assumption Standard ImageNet training (SGD, data augmentation, etc.) produces representative accuracy numbers for mobile deployment.
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency and accuracy.
-
Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
RAM-W600: A Multi-Task Wrist Dataset and Benchmark for Rheumatoid Arthritis
Introduces RAM-W600, the first public multi-task dataset of wrist conventional radiographs with instance segmentation annotations and Sharp/van der Heijde bone erosion scores for rheumatoid arthritis research.
-
LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models
LAION-C supplies six novel corruptions that stay OOD for web-scale training sets and demonstrates that leading models now rival or exceed human robustness on them.
-
VMamba: Visual State Space Model
VMamba introduces a state-space vision backbone using 2D selective scanning across four routes to achieve linear complexity and strong performance on image tasks.
-
Patch Hierarchical Attention Transformer for Efficient Particle Jet Tagging
PHAT-JeT combines geometric message-passing with hierarchical patch attention to reach state-of-the-art accuracy and background rejection among resource-constrained jet tagging models on four benchmarks.
-
Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration
DiSI disentangles stochastic interpolants into separate generation and regression paths, allowing controllable transitions between regression and generative image restoration with a unified few-step sampler.
-
VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting
VMU-Diff improves precipitation nowcasting via coarse multi-source Vision Mamba fusion followed by residual conditional diffusion refinement.
-
Elastic Spiking Transformers for Efficient Gesture Understanding
A single Elastic Spiking Transformer model dynamically slices network width and attention heads at runtime via granularity-aware weight sharing, matching or exceeding fixed baselines on CIFAR and gesture datasets whil...
-
KAConvNet: Kolmogorov-Arnold Convolutional Networks for Vision Recognition
KAConvNet introduces a Kolmogorov-Arnold Convolutional Layer to build networks competitive with ViTs and CNNs while offering stronger theoretical interpretability.
-
Scalable Neural Decoders for Practical Fault-Tolerant Quantum Computation
Neural decoder for quantum LDPC codes achieves ~10^{-10} logical error at 0.1% physical error with 17x improvement and high throughput, enabling practical fault tolerance at modest code sizes.
-
Multi-Head Attention based interaction-aware architecture for Bangla Handwritten Character Recognition: Introducing a Primary Dataset
A new balanced Bangla handwritten character dataset paired with a multi-head attention hybrid model using EfficientNetB3, ViT, and Conformer achieves high accuracy and strong generalization.
-
MobileMold: A Smartphone-Based Microscopy Dataset for Food Mold Detection
MobileMold provides 4941 smartphone microscopy images and shows deep learning models reach 99.5% accuracy on mold detection and food classification tasks.
-
SocialPulse: On-Device Detection of Social Interactions in Naturalistic Settings Using Smartwatch Multimodal Sensing
SocialPulse presents an on-device smartwatch system for detecting diverse social interactions in naturalistic settings, achieving 77.28% self-report confirmation in a 38-person 900-hour deployment and 90.39% accuracy ...
-
DuFal: Dual-Frequency-Aware Learning for High-Fidelity Extremely Sparse-view CBCT Reconstruction
DuFal combines global and local high-frequency Fourier neural operators with cross-attention fusion to recover fine anatomical structures in extremely sparse-view CBCT, outperforming prior methods on LUNA16 and ToothF...
-
DISK: Differentiable Sparse Kernel Complex for Efficient Spatially-Variant Convolution
DISK is a differentiable sparse kernel decomposition method that approximates spatially-variant complex convolutions using optimized sparse samples, initialization for non-convex shapes, and interpolation, achieving h...
-
FractalMamba++: Scaling Vision Mamba Across Resolutions via Hilbert Fractal Geometry
FractalMamba++ scales Vision Mamba across resolutions by using Hilbert fractal serialization, hierarchy-based skip connections, and fractal-aware 2D rotary position encoding.
-
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
EfficientNet scales network depth, width, and resolution uniformly via a compound coefficient to deliver state-of-the-art accuracy and efficiency on image classification.
-
Searching for Activation Functions
Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.
-
VACE: Learning Geometrically Structured Representations for Time Series Anomaly Detection
VACE learns compact directionally coherent representations for multivariate time series anomaly detection via velocity-consistency training and reports state-of-the-art results on TSB-AD-M.
-
Low Latency Gaze Tracking via Latent Optical Sensing
A hardware prototype performs gaze estimation by optically encoding task-relevant features with a microlens array and mask, captured on a 4x4 phototransistor array and decoded by a small neural network, reaching 3.4 m...
-
TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts
TAS-LoRA attaches a mixture of LoRA experts to a supernet and uses a dynamic router plus group-wise initialization to let different architecture subnets learn distinct features, yielding higher accuracy than prior TAS...
-
GTF: Omnidirectional EPI Transformer for Light Field Super-Resolution
GTF is an omnidirectional EPI Transformer for light field super-resolution that models horizontal, vertical, 45-degree and 135-degree epipolar geometries, reaching 32.78 dB on benchmarks and top ranks in the NTIRE 202...
-
Hardware-Aware Neural Feature Extraction for Resource-Constrained Devices
Gideon is a hardware-aware feature extractor using distillation and DNAS that achieves 111 fps on STM32N6 under 1.5 MB memory with negligible INT8 quantization loss.
-
EdgeSpike: Spiking Neural Networks for Low-Power Autonomous Sensing in Edge IoT Architectures
EdgeSpike delivers 91.4% mean accuracy on five sensing tasks with 31x lower energy on neuromorphic hardware and 6.3x longer battery life in a seven-month field deployment compared to conventional CNNs.
-
Viewport-Unaware Blind Omnidirectional Image Quality Assessment: A Unified and Generalized Approach
Blind omnidirectional image quality assessment reduces to standard 2D blind IQA by skipping viewport generation, yielding a unified model that accepts equirectangular inputs directly.
-
H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers
H-Sets detects higher-order feature interactions in image classifiers via Hessian-guided pair merging and attributes them with IDG-Vis to generate more interpretable saliency maps than existing marginal or coarse methods.
-
Co-Design of CNN Accelerators for TinyML using Approximate Matrix Decomposition
A co-design framework using approximate matrix decomposition and genetic algorithms delivers 33% average latency reduction in TinyML CNN FPGA accelerators with 1.3% average accuracy loss versus standard systolic arrays.
-
DroneScan-YOLO: Redundancy-Aware Lightweight Detection for Tiny Objects in UAV Imagery
DroneScan-YOLO reaches 55.3% mAP@50 and 35.6% mAP@50-95 on VisDrone2019-DET by combining 1280x1280 input, RPA-Block pruning, MSFD stride-4 branch, and SAL-NWD loss, beating YOLOv8s by 16.6 and 12.3 points with only 4....
-
CODO: An Automated Compiler for Comprehensive Dataflow Optimization
CODO automates comprehensive dataflow optimization on FPGAs, achieving 1.45x-4.52x speedups on kernels and up to 33.8x on DNN models over state-of-the-art frameworks.
-
YMIR: A new Benchmark Dataset and Model for Arabic Yemeni Music Genre Classification Using Convolutional Neural Networks
YMIR dataset and YMCM CNN achieve 98.8% accuracy classifying five Yemeni music genres from audio features.
-
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
FlashSAC scales up Soft Actor-Critic with fewer updates, larger models, higher data throughput, and norm bounds to deliver faster, more stable training than PPO on high-dimensional robot control tasks across dozens of...
-
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.
-
AHC: Meta-Learned Adaptive Compression for Continual Object Detection on Memory-Constrained Microcontrollers
AHC applies meta-learned hierarchical compression with dual memory banks to enable continual object detection on MCUs under a 100KB budget, backed by a forgetting bound of O(ε√T + 1/√M) and competitive results on CORe...
-
On-chip probabilistic inference for charged-particle tracking at the sensor edge
Neural networks integrated into silicon sensor front-end electronics can regress charged-particle hit positions and angles with calibrated uncertainties from single-layer data while satisfying hardware constraints on ...
-
Low Cost, High Efficiency: LiDAR Place Recognition in Vineyards with Matryoshka Representation Learning
MinkUNeXt-VINE applies Matryoshka Representation Learning to achieve efficient, high-performing place recognition from sparse LiDAR in vineyards, beating state-of-the-art on two real long-term datasets.
-
Versatile yet Efficient Network Traffic Analysis: Offloading Network Foundation Model to SmartNIC
Nepco offloads network foundation models to SmartNICs using localized byte-sequence modeling and a pattern-aware convolutional architecture to achieve competitive macro F1 scores with 328x lower end-to-end latency tha...
-
Variational Autoencoder-Based Black-Box Adversarial Attack on Collaborative DNN Inference
AdVAR-DNN employs a variational autoencoder to create untraceable adversarial samples that compromise black-box collaborative DNN inference by exploiting model partitioning information exchange, achieving high misclas...
-
Expressive yet Efficient Feature Expansion with Adaptive Cross-Hadamard Products
Proposes ACH module with differentiable sampling and softsign normalization for efficient feature expansion, integrated via NAS into Hadaptive-Net to claim SOTA accuracy/speed trade-offs on image classification.
-
ESSR: An 8K@30FPS Super-Resolution Accelerator With Edge Selective Network
An 8K super-resolution accelerator using edge-selective dynamic processing achieves 30 FPS with 50% fewer MAC operations and 84% smaller model while keeping PSNR loss under 0.6 dB.
-
Deep Privacy Funnel Model: From a Discriminative to a Generative Approach with an Application to Face Recognition
Introduces Generative Privacy Funnel (GenPF) and deep variational PF (DVPF) models that extend the privacy funnel to generative settings and provide a controllable privacy-utility trade-off with reduced sensitive attr...
-
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
-
RECALL: Rehearsal-free Continual Learning for Object Classification
RECALL achieves rehearsal-free continual learning for object classification by logit recall before new training, regression regularization, Mahalanobis loss on known categories, and new heads per sequence, outperformi...
-
BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View
BEVDet achieves 39.3% mAP and 47.2% NDS on nuScenes val set with a fast BEV-based multi-camera 3D detector that outperforms FCOS3D while using far less compute in its tiny variant.
-
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer
MobileViT is a lightweight vision transformer that reports 78.4% top-1 accuracy on ImageNet-1k with ~6M parameters, outperforming MobileNetv3 by 3.2% and DeIT by 6.2% at similar size, plus gains on MS-COCO detection.
-
Co-Evolutionary Compression for Unpaired Image Translation
A co-evolutionary compression technique reduces parameters and FLOPs in unpaired image-to-image translation GAN generators while maintaining translation quality on benchmarks.
-
DeepOrganNet: On-the-Fly Reconstruction and Visualization of 3D / 4D Lung Models from Single-View Projections by Deep Deformation Network
DeepOrganNet reconstructs 3D/4D lung meshes from single-view 2D projections by learning smooth deformation fields from multiple templates via a deep network and trivariate tensor-product deformation.
-
Open DNN Box by Power Side-Channel Attack
Power side-channel analysis recovers DNN architecture and parameters at 96.5% average accuracy on real embedded devices.
-
Separable Convolutional LSTMs for Faster Video Segmentation
Separable convLSTMs cut parameters and FLOPs in video segmentation, delivering up to 15% faster GPU inference with similar or slightly lower accuracy.
-
A Unified Optimization Approach for CNN Model Inference on Integrated GPUs
A unified IR plus ML-based scheduling for CNN inference on multi-vendor integrated GPUs matches or exceeds vendor libraries (up to 1.62x) on image models while supporting more models.
-
COP: Customized Deep Model Compression via Regularized Correlation-Based Filter-Level Pruning
COP prunes CNN filters using correlation-based importance with global normalization and dual regularization on parameter quantity and FLOPs to enable customized compression.
-
Transferable 3D Convolutional Neural Networks for Elastic Constants Prediction in Nanoporous Metals
3D CNNs predict elastic moduli of nanoporous metals with R²=0.955, outperforming descriptor-based models, and transfer learning works on smaller denser datasets for large-scale Pareto optimization.
-
Consistently Informative Soft-Label Temperature for Knowledge Distillation
CIST uses per-sample adaptive temperatures for both teacher and student in knowledge distillation to ensure consistent entropy in soft labels and reports gains on vision and language tasks.
-
Personalized Face Privacy Protection From a Single Image
FaceCloak learns a lightweight identity-specific cloaking mask from a single image via synthetic face generation and iterative embedding perturbation to evade multiple recognition models.
-
When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing
Sparse MoE vision models show positive accuracy gaps only when routing a substantial compute fraction ρ and using k≥2 experts at large scale; batch-axis dispatch is identified as a key failure mode.
-
Are Candidate Models Really Needed for Active Learning?
Active learning with randomly initialized models achieves comparable results to traditional candidate-model methods, with low-confidence sampling proving most effective.
-
TREA: Low-precision Time-Multiplexed, Resource-Efficient Edge Accelerator for Object Detection and Classification
TREA is a low-precision time-multiplexed edge accelerator using dual-precision SIMD MAC units, structured pruning, and reconfigurable activation cores to deliver up to 9x kernel-level latency reduction for object dete...
-
LIVEditor-14B: Lightning Unified Video Editing via In-Context Sparse Attention
ISA prunes low-saliency context tokens and routes queries by sharpness to either full or 0-th order Taylor sparse attention, enabling LIVEditor to cut attention latency ~60% while beating prior video editing methods o...
-
Memory-Efficient EDA Denoising via Knowledge Distillation for Wearable IoT Under Severe Motion Artifacts and Underwater Conditions
Knowledge distillation from a hybrid CNN-Transformer teacher to a depth-wise separable CNN student, combined with realistic motion and environmental augmentation, produces a 15x smaller EDA denoiser that cuts underwat...
-
Keypoint-based Dynamic Object 6-DoF Pose Tracking via Event Camera
A keypoint-based pipeline extracts and tracks points from event streams to compute accurate 6-DoF poses of moving objects, outperforming prior event-based methods in simulated and real tests.
-
DeltaSeg: Tiered Attention and Deep Delta Learning for Multi-Class Structural Defect Segmentation
DeltaSeg, a tiered-attention U-Net variant with a novel Deep Delta Attention module, outperforms 12 prior models on two multi-class structural defect segmentation benchmarks.
-
Towards Topology-Aware Very Large-Scale Photonic AI Accelerators
Photonic accelerators hit a topology-driven Utilization Wall; symmetric grids improve utilization up to 6X and cut memory access over 40% versus linear layouts.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
F. Chollet. Xception: Deep learning with depthwise separa- ble convolutions. arXiv preprint arXiv:1610.02357v2, 2016. 1
work page Pith review arXiv 2016
-
[4]
Training deep neural networks with low precision multiplications
M. Courbariaux, J.-P. David, and Y . Bengio. Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024, 2014. 2
work page Pith review arXiv 2014
-
[5]
S. Han, H. Mao, and W. J. Dally. Deep compression: Com- pressing deep neural network with pruning, trained quantiza- tion and huffman coding. CoRR, abs/1510.00149, 2, 2015. 2
work page internal anchor Pith review arXiv 2015
-
[6]
J. Hays and A. Efros. IM2GPS: estimating geographic in- formation from a single image. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2008. 7
work page 2008
-
[7]
J. Hays and A. Efros. Large-Scale Image Geolocalization. In J. Choi and G. Friedland, editors, Multimodal Location Estimation of Videos and Images. Springer, 2014. 6, 7
work page 2014
-
[8]
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. arXiv preprint arXiv:1512.03385,
work page internal anchor Pith review arXiv
-
[9]
Distilling the Knowledge in a Neural Network
G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[10]
Speed/accuracy trade-offs for modern convolutional object detectors
J. Huang, V . Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y . Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. arXiv preprint arXiv:1611.10012, 2016. 7
work page Pith review arXiv 2016
-
[11]
Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations
I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y . Bengio. Quantized neural networks: Training neural net- works with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016. 2
work page Pith review arXiv 2016
-
[12]
F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 1mb model size. arXiv preprint arXiv:1602.07360, 2016. 1, 6
work page Pith review arXiv 2016
-
[13]
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 1, 3, 7
work page internal anchor Pith review arXiv 2015
-
[14]
Speeding up Convolutional Neural Networks with Low Rank Expansions
M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014. 2
work page Pith review arXiv 2014
-
[15]
Y . Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- tional architecture for fast feature embedding.arXiv preprint arXiv:1408.5093, 2014. 4
work page Pith review arXiv 2014
-
[16]
J. Jin, A. Dundar, and E. Culurciello. Flattened convolutional neural networks for feedforward acceleration. arXiv preprint arXiv:1412.5474, 2014. 1, 3
work page Pith review arXiv 2014
- [17]
- [18]
-
[19]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems , pages 1097–1105, 2012. 1, 6
work page 2012
-
[20]
Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition
V . Lebedev, Y . Ganin, M. Rakhuba, I. Oseledets, and V . Lempitsky. Speeding-up convolutional neural net- works using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014. 2
work page Pith review arXiv 2014
-
[21]
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. Ssd: Single shot multibox detector. arXiv preprint arXiv:1512.02325, 2015. 7
work page Pith review arXiv 2015
-
[22]
XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
M. Rastegari, V . Ordonez, J. Redmon, and A. Farhadi. Xnor- net: Imagenet classification using binary convolutional neu- ral networks. arXiv preprint arXiv:1603.05279, 2016. 1, 2
work page Pith review arXiv 2016
-
[23]
S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems , pages 91–99, 2015. 7
work page 2015
-
[24]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision , 115(3):211–252,
-
[25]
F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni- fied embedding for face recognition and clustering. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015. 8
work page 2015
-
[26]
L. Sifre. Rigid-motion scattering for image classification . PhD thesis, Ph. D. thesis, 2014. 1, 3
work page 2014
-
[27]
Very Deep Convolutional Networks for Large-Scale Image Recognition
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 1, 6
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[28]
V . Sindhwani, T. Sainath, and S. Kumar. Structured trans- forms for small-footprint deep learning. In Advances in Neural Information Processing Systems , pages 3088–3096,
-
[29]
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
C. Szegedy, S. Ioffe, and V . Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016. 1
work page Pith review arXiv 2016
-
[30]
C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 1–9, 2015. 6
work page 2015
-
[31]
Rethinking the Inception Architecture for Computer Vision
C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015. 1, 3, 4, 7
work page Pith review arXiv 2015
- [32]
-
[33]
T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning , 4(2),
- [34]
- [35]
- [36]
-
[37]
Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. Wang. Deep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision , pages 1476–1483, 2015. 1
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.