MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G. Howard; Bo Chen; Dmitry Kalenichenko; Hartwig Adam; Marco Andreetto; Menglong Zhu; Tobias Weyand; Weijun Wang

arxiv: 1704.04861 · v1 · submitted 2017-04-17 · 💻 cs.CV

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G. Howard , Menglong Zhu , Bo Chen , Dmitry Kalenichenko , Weijun Wang , Tobias Weyand , Marco Andreetto , Hartwig Adam This is my paper

Pith reviewed 2026-05-11 02:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords MobileNetsdepth-wise separable convolutionsefficient convolutional networksmobile vision applicationswidth and resolution multipliersImageNet classificationmodel scaling

0 comments

The pith

MobileNets use depth-wise separable convolutions and two global scaling hyperparameters to build lightweight networks that trade latency for accuracy on mobile devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MobileNets as a family of efficient convolutional neural networks designed specifically for mobile and embedded vision applications. These models replace standard convolutions with depth-wise separable operations that factor spatial filtering from channel mixing to reduce computation and parameters. Two simple hyperparameters, one for width and one for resolution, allow uniform scaling of the entire network to hit different speed and accuracy targets. Extensive tests on ImageNet show competitive accuracy for the given resources, and the same models transfer to object detection, fine-grained classification, face attributes, and geo-localization without custom redesign.

Core claim

MobileNets are built from depth-wise separable convolutions that split each standard convolution into a per-channel spatial filter followed by a 1x1 point-wise combination, producing far fewer operations. The architecture adds two global hyperparameters: a width multiplier that uniformly reduces the number of channels across layers, and a resolution multiplier that shrinks the input image size. These parameters let a single base design generate a range of models that match the latency budgets of different mobile hardware while keeping enough capacity for high accuracy on vision tasks.

What carries the argument

Depth-wise separable convolutions that separate spatial and channel operations, combined with uniform width and resolution multipliers for global scaling.

If this is right

A single architecture family can generate models sized to fit specific hardware limits and accuracy needs.
The networks achieve strong accuracy-latency tradeoffs on ImageNet classification relative to other common models.
The same models transfer effectively to object detection, fine-grain classification, face attribute prediction, and large-scale geo-localization.
Model builders can select the right size using only the two global hyperparameters instead of redesigning the network.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The uniform scaling approach could be tested on other convolution-based architectures to see if similar efficiency gains appear without full redesign.
Real-time vision pipelines on edge hardware become more practical when accuracy can be dialed to match available compute.
Combining these separable blocks with input-dependent scaling might further reduce average latency on varied data.

Load-bearing premise

Depth-wise separable convolutions plus uniform width and resolution scaling preserve enough representational power across the full range of target tasks and hardware constraints without needing task-specific redesign.

What would settle it

Measuring that a MobileNet variant with a chosen width and resolution multiplier falls well below its predicted ImageNet accuracy or fails to produce usable results on a mobile device for object detection would disprove the claim that the scaling method works across constraints.

read the original abstract

We present a class of efficient models called MobileNets for mobile and embedded vision applications. MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency and accuracy. These hyper-parameters allow the model builder to choose the right sized model for their application based on the constraints of the problem. We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification. We then demonstrate the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 0 minor

Summary. The manuscript introduces MobileNets, a family of lightweight convolutional neural networks designed for mobile and embedded vision applications. The architecture relies on depth-wise separable convolutions and introduces two global hyperparameters (width multiplier and resolution multiplier) that allow trading off between model accuracy and computational efficiency (latency and size). The authors provide extensive empirical evaluation on ImageNet classification and show the models' utility in several downstream tasks such as object detection, fine-grained classification, face attribute classification, and geo-localization.

Significance. If the central claims hold, this work has high significance for the field of efficient deep learning. It demonstrates that a simple architectural choice combined with straightforward scaling rules can produce models that achieve good accuracy-latency trade-offs across a range of vision applications. The transparent presentation of results on held-out data and the applicability to multiple tasks without per-task redesign are strengths that could influence subsequent research on mobile-optimized networks.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the work's significance, and recommendation to accept the manuscript. No major comments were provided for us to address.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The MobileNets architecture is defined directly via depthwise separable convolutions (a pre-existing factorization) plus two explicit user-selectable scalar multipliers for width and resolution. All reported results consist of measured top-1 accuracy, multiply-add counts, and latency on held-out ImageNet validation plus transfer tasks; the multipliers are not fitted inside the reported experiments but chosen by the model builder. No equation or claim reduces by construction to its own inputs, no uniqueness theorem is invoked, and no self-citation chain carries the central empirical demonstration.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The work rests on standard CNN training assumptions and the empirical observation that depthwise separable convolutions are a good efficiency-accuracy trade-off. No new physical entities or unproven mathematical axioms are introduced.

free parameters (2)

width multiplier alpha
Global scalar that uniformly reduces channel counts in every layer; chosen by the model builder to meet latency targets.
resolution multiplier rho
Scalar that reduces input image resolution; chosen by the model builder.

axioms (2)

domain assumption Depthwise separable convolutions preserve sufficient feature quality for the target vision tasks when applied uniformly across layers.
Invoked in section 3 when defining the MobileNet block and when claiming the architecture remains effective after scaling.
domain assumption Standard ImageNet training (SGD, data augmentation, etc.) produces representative accuracy numbers for mobile deployment.
Used throughout the experimental section without additional justification.

pith-pipeline@v0.9.0 · 5436 in / 1479 out tokens · 29210 ms · 2026-05-11T02:45:24.717356+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency and accuracy.
Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RAM-W600: A Multi-Task Wrist Dataset and Benchmark for Rheumatoid Arthritis
eess.IV 2025-07 unverdicted novelty 8.0

Introduces RAM-W600, the first public multi-task dataset of wrist conventional radiographs with instance segmentation annotations and Sharp/van der Heijde bone erosion scores for rheumatoid arthritis research.
LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models
cs.CV 2025-06 accept novelty 8.0

LAION-C supplies six novel corruptions that stay OOD for web-scale training sets and demonstrates that leading models now rival or exceed human robustness on them.
VMamba: Visual State Space Model
cs.CV 2024-01 conditional novelty 8.0

VMamba introduces a state-space vision backbone using 2D selective scanning across four routes to achieve linear complexity and strong performance on image tasks.
Patch Hierarchical Attention Transformer for Efficient Particle Jet Tagging
hep-ex 2026-05 unverdicted novelty 7.0

PHAT-JeT combines geometric message-passing with hierarchical patch attention to reach state-of-the-art accuracy and background rejection among resource-constrained jet tagging models on four benchmarks.
Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration
cs.CV 2026-05 unverdicted novelty 7.0

DiSI disentangles stochastic interpolants into separate generation and regression paths, allowing controllable transitions between regression and generative image restoration with a unified few-step sampler.
VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting
cs.CV 2026-05 unverdicted novelty 7.0

VMU-Diff improves precipitation nowcasting via coarse multi-source Vision Mamba fusion followed by residual conditional diffusion refinement.
Elastic Spiking Transformers for Efficient Gesture Understanding
cs.NE 2026-05 unverdicted novelty 7.0

A single Elastic Spiking Transformer model dynamically slices network width and attention heads at runtime via granularity-aware weight sharing, matching or exceeding fixed baselines on CIFAR and gesture datasets whil...
KAConvNet: Kolmogorov-Arnold Convolutional Networks for Vision Recognition
cs.CV 2026-04 unverdicted novelty 7.0

KAConvNet introduces a Kolmogorov-Arnold Convolutional Layer to build networks competitive with ViTs and CNNs while offering stronger theoretical interpretability.
Scalable Neural Decoders for Practical Fault-Tolerant Quantum Computation
quant-ph 2026-04 unverdicted novelty 7.0

Neural decoder for quantum LDPC codes achieves ~10^{-10} logical error at 0.1% physical error with 17x improvement and high throughput, enabling practical fault tolerance at modest code sizes.
Multi-Head Attention based interaction-aware architecture for Bangla Handwritten Character Recognition: Introducing a Primary Dataset
cs.CV 2026-04 accept novelty 7.0

A new balanced Bangla handwritten character dataset paired with a multi-head attention hybrid model using EfficientNetB3, ViT, and Conformer achieves high accuracy and strong generalization.
MobileMold: A Smartphone-Based Microscopy Dataset for Food Mold Detection
cs.CV 2026-03 unverdicted novelty 7.0

MobileMold provides 4941 smartphone microscopy images and shows deep learning models reach 99.5% accuracy on mold detection and food classification tasks.
SocialPulse: On-Device Detection of Social Interactions in Naturalistic Settings Using Smartwatch Multimodal Sensing
cs.HC 2026-02 conditional novelty 7.0

SocialPulse presents an on-device smartwatch system for detecting diverse social interactions in naturalistic settings, achieving 77.28% self-report confirmation in a 38-person 900-hour deployment and 90.39% accuracy ...
DuFal: Dual-Frequency-Aware Learning for High-Fidelity Extremely Sparse-view CBCT Reconstruction
cs.CV 2026-01 unverdicted novelty 7.0

DuFal combines global and local high-frequency Fourier neural operators with cross-attention fusion to recover fine anatomical structures in extremely sparse-view CBCT, outperforming prior methods on LUNA16 and ToothF...
DISK: Differentiable Sparse Kernel Complex for Efficient Spatially-Variant Convolution
cs.GR 2025-12 unverdicted novelty 7.0

DISK is a differentiable sparse kernel decomposition method that approximates spatially-variant complex convolutions using optimized sparse samples, initialization for non-convex shapes, and interpolation, achieving h...
FractalMamba++: Scaling Vision Mamba Across Resolutions via Hilbert Fractal Geometry
cs.CV 2025-05 unverdicted novelty 7.0

FractalMamba++ scales Vision Mamba across resolutions by using Hilbert fractal serialization, hierarchy-based skip connections, and fractal-aware 2D rotary position encoding.
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
cs.LG 2019-05 accept novelty 7.0

EfficientNet scales network depth, width, and resolution uniformly via a compound coefficient to deliver state-of-the-art accuracy and efficiency on image classification.
Searching for Activation Functions
cs.NE 2017-10 conditional novelty 7.0

Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.
VACE: Learning Geometrically Structured Representations for Time Series Anomaly Detection
cs.LG 2026-05 unverdicted novelty 6.0

VACE learns compact directionally coherent representations for multivariate time series anomaly detection via velocity-consistency training and reports state-of-the-art results on TSB-AD-M.
Low Latency Gaze Tracking via Latent Optical Sensing
cs.CV 2026-05 unverdicted novelty 6.0

A hardware prototype performs gaze estimation by optically encoding task-relevant features with a microlens array and mask, captured on a 4x4 phototransistor array and decoded by a small neural network, reaching 3.4 m...
TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts
cs.CV 2026-05 unverdicted novelty 6.0

TAS-LoRA attaches a mixture of LoRA experts to a supernet and uses a dynamic router plus group-wise initialization to let different architecture subnets learn distinct features, yielding higher accuracy than prior TAS...
GTF: Omnidirectional EPI Transformer for Light Field Super-Resolution
cs.CV 2026-05 unverdicted novelty 6.0

GTF is an omnidirectional EPI Transformer for light field super-resolution that models horizontal, vertical, 45-degree and 135-degree epipolar geometries, reaching 32.78 dB on benchmarks and top ranks in the NTIRE 202...
Hardware-Aware Neural Feature Extraction for Resource-Constrained Devices
cs.LG 2026-05 unverdicted novelty 6.0

Gideon is a hardware-aware feature extractor using distillation and DNAS that achieves 111 fps on STM32N6 under 1.5 MB memory with negligible INT8 quantization loss.
EdgeSpike: Spiking Neural Networks for Low-Power Autonomous Sensing in Edge IoT Architectures
cs.NE 2026-04 unverdicted novelty 6.0

EdgeSpike delivers 91.4% mean accuracy on five sensing tasks with 31x lower energy on neuromorphic hardware and 6.3x longer battery life in a seven-month field deployment compared to conventional CNNs.
Viewport-Unaware Blind Omnidirectional Image Quality Assessment: A Unified and Generalized Approach
cs.CV 2026-04 unverdicted novelty 6.0

Blind omnidirectional image quality assessment reduces to standard 2D blind IQA by skipping viewport generation, yielding a unified model that accepts equirectangular inputs directly.
H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers
cs.CV 2026-04 unverdicted novelty 6.0

H-Sets detects higher-order feature interactions in image classifiers via Hessian-guided pair merging and attributes them with IDG-Vis to generate more interpretable saliency maps than existing marginal or coarse methods.
Co-Design of CNN Accelerators for TinyML using Approximate Matrix Decomposition
cs.AR 2026-04 unverdicted novelty 6.0

A co-design framework using approximate matrix decomposition and genetic algorithms delivers 33% average latency reduction in TinyML CNN FPGA accelerators with 1.3% average accuracy loss versus standard systolic arrays.
DroneScan-YOLO: Redundancy-Aware Lightweight Detection for Tiny Objects in UAV Imagery
cs.CV 2026-04 unverdicted novelty 6.0

DroneScan-YOLO reaches 55.3% mAP@50 and 35.6% mAP@50-95 on VisDrone2019-DET by combining 1280x1280 input, RPA-Block pruning, MSFD stride-4 branch, and SAL-NWD loss, beating YOLOv8s by 16.6 and 12.3 points with only 4....
CODO: An Automated Compiler for Comprehensive Dataflow Optimization
cs.AR 2026-04 unverdicted novelty 6.0

CODO automates comprehensive dataflow optimization on FPGAs, achieving 1.45x-4.52x speedups on kernels and up to 33.8x on DNN models over state-of-the-art frameworks.
YMIR: A new Benchmark Dataset and Model for Arabic Yemeni Music Genre Classification Using Convolutional Neural Networks
cs.SD 2026-04 conditional novelty 6.0

YMIR dataset and YMCM CNN achieve 98.8% accuracy classifying five Yemeni music genres from audio features.
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
cs.LG 2026-04 unverdicted novelty 6.0

FlashSAC scales up Soft Actor-Critic with fewer updates, larger models, higher data throughput, and norm bounds to deliver faster, more stable training than PPO on high-dimensional robot control tasks across dozens of...
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
cs.LG 2026-04 unverdicted novelty 6.0

FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.
AHC: Meta-Learned Adaptive Compression for Continual Object Detection on Memory-Constrained Microcontrollers
cs.AI 2026-02 unverdicted novelty 6.0

AHC applies meta-learned hierarchical compression with dual memory banks to enable continual object detection on MCUs under a 100KB budget, backed by a forgetting bound of O(ε√T + 1/√M) and competitive results on CORe...
On-chip probabilistic inference for charged-particle tracking at the sensor edge
physics.ins-det 2026-02 unverdicted novelty 6.0

Neural networks integrated into silicon sensor front-end electronics can regress charged-particle hit positions and angles with calibrated uncertainties from single-layer data while satisfying hardware constraints on ...
Low Cost, High Efficiency: LiDAR Place Recognition in Vineyards with Matryoshka Representation Learning
cs.CV 2026-01 unverdicted novelty 6.0

MinkUNeXt-VINE applies Matryoshka Representation Learning to achieve efficient, high-performing place recognition from sparse LiDAR in vineyards, beating state-of-the-art on two real long-term datasets.
Versatile yet Efficient Network Traffic Analysis: Offloading Network Foundation Model to SmartNIC
cs.NI 2025-08 unverdicted novelty 6.0

Nepco offloads network foundation models to SmartNICs using localized byte-sequence modeling and a pattern-aware convolutional architecture to achieve competitive macro F1 scores with 328x lower end-to-end latency tha...
Variational Autoencoder-Based Black-Box Adversarial Attack on Collaborative DNN Inference
cs.CR 2025-08 unverdicted novelty 6.0

AdVAR-DNN employs a variational autoencoder to create untraceable adversarial samples that compromise black-box collaborative DNN inference by exploiting model partitioning information exchange, achieving high misclas...
Expressive yet Efficient Feature Expansion with Adaptive Cross-Hadamard Products
cs.CV 2025-05 unverdicted novelty 6.0

Proposes ACH module with differentiable sampling and softsign normalization for efficient feature expansion, integrated via NAS into Hadaptive-Net to claim SOTA accuracy/speed trade-offs on image classification.
ESSR: An 8K@30FPS Super-Resolution Accelerator With Edge Selective Network
cs.AR 2025-03 unverdicted novelty 6.0

An 8K super-resolution accelerator using edge-selective dynamic processing achieves 30 FPS with 50% fewer MAC operations and 84% smaller model while keeping PSNR loss under 0.6 dB.
Deep Privacy Funnel Model: From a Discriminative to a Generative Approach with an Application to Face Recognition
cs.LG 2024-04 unverdicted novelty 6.0

Introduces Generative Privacy Funnel (GenPF) and deep variational PF (DVPF) models that extend the privacy funnel to generative settings and provide a controllable privacy-utility trade-off with reduced sensitive attr...
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
cs.CL 2024-02 conditional novelty 6.0

DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
RECALL: Rehearsal-free Continual Learning for Object Classification
cs.CV 2022-09 unverdicted novelty 6.0

RECALL achieves rehearsal-free continual learning for object classification by logit recall before new training, regression regularization, Mahalanobis loss on known categories, and new heads per sequence, outperformi...
BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View
cs.CV 2021-12 conditional novelty 6.0

BEVDet achieves 39.3% mAP and 47.2% NDS on nuScenes val set with a fast BEV-based multi-camera 3D detector that outperforms FCOS3D while using far less compute in its tiny variant.
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer
cs.CV 2021-10 unverdicted novelty 6.0

MobileViT is a lightweight vision transformer that reports 78.4% top-1 accuracy on ImageNet-1k with ~6M parameters, outperforming MobileNetv3 by 3.2% and DeIT by 6.2% at similar size, plus gains on MS-COCO detection.
Co-Evolutionary Compression for Unpaired Image Translation
cs.CV 2019-07 unverdicted novelty 6.0

A co-evolutionary compression technique reduces parameters and FLOPs in unpaired image-to-image translation GAN generators while maintaining translation quality on benchmarks.
DeepOrganNet: On-the-Fly Reconstruction and Visualization of 3D / 4D Lung Models from Single-View Projections by Deep Deformation Network
cs.GR 2019-07 unverdicted novelty 6.0

DeepOrganNet reconstructs 3D/4D lung meshes from single-view 2D projections by learning smooth deformation fields from multiple templates via a deep network and trivariate tensor-product deformation.
Open DNN Box by Power Side-Channel Attack
cs.CR 2019-07 unverdicted novelty 6.0

Power side-channel analysis recovers DNN architecture and parameters at 96.5% average accuracy on real embedded devices.
Separable Convolutional LSTMs for Faster Video Segmentation
cs.CV 2019-07 unverdicted novelty 6.0

Separable convLSTMs cut parameters and FLOPs in video segmentation, delivering up to 15% faster GPU inference with similar or slightly lower accuracy.
A Unified Optimization Approach for CNN Model Inference on Integrated GPUs
cs.DC 2019-07 unverdicted novelty 6.0

A unified IR plus ML-based scheduling for CNN inference on multi-vendor integrated GPUs matches or exceeds vendor libraries (up to 1.62x) on image models while supporting more models.
COP: Customized Deep Model Compression via Regularized Correlation-Based Filter-Level Pruning
cs.CV 2019-06 unverdicted novelty 6.0

COP prunes CNN filters using correlation-based importance with global normalization and dual regularization on parameter quantity and FLOPs to enable customized compression.
Transferable 3D Convolutional Neural Networks for Elastic Constants Prediction in Nanoporous Metals
cond-mat.mtrl-sci 2026-05 conditional novelty 5.0

3D CNNs predict elastic moduli of nanoporous metals with R²=0.955, outperforming descriptor-based models, and transfer learning works on smaller denser datasets for large-scale Pareto optimization.
Consistently Informative Soft-Label Temperature for Knowledge Distillation
cs.LG 2026-05 unverdicted novelty 5.0

CIST uses per-sample adaptive temperatures for both teacher and student in knowledge distillation to ensure consistent entropy in soft labels and reports gains on vision and language tasks.
Personalized Face Privacy Protection From a Single Image
cs.CV 2026-05 unverdicted novelty 5.0

FaceCloak learns a lightweight identity-specific cloaking mask from a single image via synthetic face generation and iterative embedding perturbation to evade multiple recognition models.
When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing
cs.CV 2026-05 unverdicted novelty 5.0

Sparse MoE vision models show positive accuracy gaps only when routing a substantial compute fraction ρ and using k≥2 experts at large scale; batch-axis dispatch is identified as a key failure mode.
Are Candidate Models Really Needed for Active Learning?
cs.CV 2026-05 unverdicted novelty 5.0

Active learning with randomly initialized models achieves comparable results to traditional candidate-model methods, with low-confidence sampling proving most effective.
TREA: Low-precision Time-Multiplexed, Resource-Efficient Edge Accelerator for Object Detection and Classification
cs.AR 2026-05 unverdicted novelty 5.0

TREA is a low-precision time-multiplexed edge accelerator using dual-precision SIMD MAC units, structured pruning, and reconfigurable activation cores to deliver up to 9x kernel-level latency reduction for object dete...
LIVEditor-14B: Lightning Unified Video Editing via In-Context Sparse Attention
cs.CV 2026-05 unverdicted novelty 5.0

ISA prunes low-saliency context tokens and routes queries by sharpness to either full or 0-th order Taylor sparse attention, enabling LIVEditor to cut attention latency ~60% while beating prior video editing methods o...
Memory-Efficient EDA Denoising via Knowledge Distillation for Wearable IoT Under Severe Motion Artifacts and Underwater Conditions
eess.SP 2026-05 conditional novelty 5.0

Knowledge distillation from a hybrid CNN-Transformer teacher to a depth-wise separable CNN student, combined with realistic motion and environmental augmentation, produces a 15x smaller EDA denoiser that cuts underwat...
Keypoint-based Dynamic Object 6-DoF Pose Tracking via Event Camera
cs.CV 2026-04 unverdicted novelty 5.0

A keypoint-based pipeline extracts and tracks points from event streams to compute accurate 6-DoF poses of moving objects, outperforming prior event-based methods in simulated and real tests.
DeltaSeg: Tiered Attention and Deep Delta Learning for Multi-Class Structural Defect Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

DeltaSeg, a tiered-attention U-Net variant with a novel Deep Delta Attention module, outperforms 12 prior models on two multi-class structural defect segmentation benchmarks.
Towards Topology-Aware Very Large-Scale Photonic AI Accelerators
cs.AR 2026-04 unverdicted novelty 5.0

Photonic accelerators hit a topology-driven Utilization Wall; symmetric grids improve utilization up to 6X and cut memory access over 40% versus linear layouts.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 127 Pith papers · 5 internal anchors

[1]

Abadi, A

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorﬂow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorﬂow. org , 1,

work page 2015
[2]

W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y . Chen. Compressing neural networks with the hashing trick. CoRR, abs/1504.04788, 2015. 2

work page arXiv 2015
[3]

F. Chollet. Xception: Deep learning with depthwise separa- ble convolutions. arXiv preprint arXiv:1610.02357v2, 2016. 1

work page Pith review arXiv 2016
[4]

Training deep neural networks with low precision multiplications

M. Courbariaux, J.-P. David, and Y . Bengio. Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024, 2014. 2

work page Pith review arXiv 2014
[5]

S. Han, H. Mao, and W. J. Dally. Deep compression: Com- pressing deep neural network with pruning, trained quantiza- tion and huffman coding. CoRR, abs/1510.00149, 2, 2015. 2

work page internal anchor Pith review arXiv 2015
[6]

Hays and A

J. Hays and A. Efros. IM2GPS: estimating geographic in- formation from a single image. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2008. 7

work page 2008
[7]

Hays and A

J. Hays and A. Efros. Large-Scale Image Geolocalization. In J. Choi and G. Friedland, editors, Multimodal Location Estimation of Videos and Images. Springer, 2014. 6, 7

work page 2014
[8]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. arXiv preprint arXiv:1512.03385,

work page internal anchor Pith review arXiv
[9]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

Speed/accuracy trade-offs for modern convolutional object detectors

J. Huang, V . Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y . Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. arXiv preprint arXiv:1611.10012, 2016. 7

work page Pith review arXiv 2016
[11]

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y . Bengio. Quantized neural networks: Training neural net- works with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016. 2

work page Pith review arXiv 2016
[12]

F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 1mb model size. arXiv preprint arXiv:1602.07360, 2016. 1, 6

work page Pith review arXiv 2016
[13]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 1, 3, 7

work page internal anchor Pith review arXiv 2015
[14]

Speeding up Convolutional Neural Networks with Low Rank Expansions

M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014. 2

work page Pith review arXiv 2014
[15]

Y . Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- tional architecture for fast feature embedding.arXiv preprint arXiv:1408.5093, 2014. 4

work page Pith review arXiv 2014
[16]

J. Jin, A. Dundar, and E. Culurciello. Flattened convolutional neural networks for feedforward acceleration. arXiv preprint arXiv:1412.5474, 2014. 1, 3

work page Pith review arXiv 2014
[17]

Khosla, N

A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for ﬁne-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition , Colorado Springs, CO, June 2011. 6

work page 2011
[18]

Krause, B

J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, and L. Fei-Fei. The unreasonable ef- fectiveness of noisy data for ﬁne-grained recognition. arXiv preprint arXiv:1511.06789, 2015. 6

work page arXiv 2015
[19]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems , pages 1097–1105, 2012. 1, 6

work page 2012
[20]

Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition

V . Lebedev, Y . Ganin, M. Rakhuba, I. Oseledets, and V . Lempitsky. Speeding-up convolutional neural net- works using ﬁne-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014. 2

work page Pith review arXiv 2014
[21]

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. Ssd: Single shot multibox detector. arXiv preprint arXiv:1512.02325, 2015. 7

work page Pith review arXiv 2015
[22]

XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

M. Rastegari, V . Ordonez, J. Redmon, and A. Farhadi. Xnor- net: Imagenet classiﬁcation using binary convolutional neu- ral networks. arXiv preprint arXiv:1603.05279, 2016. 1, 2

work page Pith review arXiv 2016
[23]

S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems , pages 91–99, 2015. 7

work page 2015
[24]

Russakovsky, J

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision , 115(3):211–252,

work page
[25]

Schroff, D

F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni- ﬁed embedding for face recognition and clustering. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015. 8

work page 2015
[26]

L. Sifre. Rigid-motion scattering for image classiﬁcation . PhD thesis, Ph. D. thesis, 2014. 1, 3

work page 2014
[27]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2014
[28]

Sindhwani, T

V . Sindhwani, T. Sainath, and S. Kumar. Structured trans- forms for small-footprint deep learning. In Advances in Neural Information Processing Systems , pages 3088–3096,

work page
[29]

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

C. Szegedy, S. Ioffe, and V . Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016. 1

work page Pith review arXiv 2016
[30]

Szegedy, W

C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 1–9, 2015. 6

work page 2015
[31]

Rethinking the Inception Architecture for Computer Vision

C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015. 1, 3, 4, 7

work page Pith review arXiv 2015
[32]

Thomee, D

B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. Yfcc100m: The new data in multimedia research. Communications of the ACM , 59(2):64–73, 2016. 7

work page 2016
[33]

Tieleman and G

T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning , 4(2),

work page
[34]

M. Wang, B. Liu, and H. Foroosh. Factorized convolutional neural networks. arXiv preprint arXiv:1608.04337, 2016. 1

work page arXiv 2016
[35]

Weyand, I

T. Weyand, I. Kostrikov, and J. Philbin. PlaNet - Photo Ge- olocation with Convolutional Neural Networks. InEuropean Conference on Computer Vision (ECCV), 2016. 6, 7

work page 2016
[36]

J. Wu, C. Leng, Y . Wang, Q. Hu, and J. Cheng. Quantized convolutional neural networks for mobile devices. arXiv preprint arXiv:1512.06473, 2015. 1

work page arXiv 2015
[37]

Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. Wang. Deep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision , pages 1476–1483, 2015. 1

work page 2015

[1] [1]

Abadi, A

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorﬂow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorﬂow. org , 1,

work page 2015

[2] [2]

W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y . Chen. Compressing neural networks with the hashing trick. CoRR, abs/1504.04788, 2015. 2

work page arXiv 2015

[3] [3]

F. Chollet. Xception: Deep learning with depthwise separa- ble convolutions. arXiv preprint arXiv:1610.02357v2, 2016. 1

work page Pith review arXiv 2016

[4] [4]

Training deep neural networks with low precision multiplications

M. Courbariaux, J.-P. David, and Y . Bengio. Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024, 2014. 2

work page Pith review arXiv 2014

[5] [5]

S. Han, H. Mao, and W. J. Dally. Deep compression: Com- pressing deep neural network with pruning, trained quantiza- tion and huffman coding. CoRR, abs/1510.00149, 2, 2015. 2

work page internal anchor Pith review arXiv 2015

[6] [6]

Hays and A

J. Hays and A. Efros. IM2GPS: estimating geographic in- formation from a single image. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2008. 7

work page 2008

[7] [7]

Hays and A

J. Hays and A. Efros. Large-Scale Image Geolocalization. In J. Choi and G. Friedland, editors, Multimodal Location Estimation of Videos and Images. Springer, 2014. 6, 7

work page 2014

[8] [8]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. arXiv preprint arXiv:1512.03385,

work page internal anchor Pith review arXiv

[9] [9]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2015

[10] [10]

Speed/accuracy trade-offs for modern convolutional object detectors

J. Huang, V . Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y . Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. arXiv preprint arXiv:1611.10012, 2016. 7

work page Pith review arXiv 2016

[11] [11]

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y . Bengio. Quantized neural networks: Training neural net- works with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016. 2

work page Pith review arXiv 2016

[12] [12]

F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 1mb model size. arXiv preprint arXiv:1602.07360, 2016. 1, 6

work page Pith review arXiv 2016

[13] [13]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 1, 3, 7

work page internal anchor Pith review arXiv 2015

[14] [14]

Speeding up Convolutional Neural Networks with Low Rank Expansions

M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014. 2

work page Pith review arXiv 2014

[15] [15]

Y . Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- tional architecture for fast feature embedding.arXiv preprint arXiv:1408.5093, 2014. 4

work page Pith review arXiv 2014

[16] [16]

J. Jin, A. Dundar, and E. Culurciello. Flattened convolutional neural networks for feedforward acceleration. arXiv preprint arXiv:1412.5474, 2014. 1, 3

work page Pith review arXiv 2014

[17] [17]

Khosla, N

A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for ﬁne-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition , Colorado Springs, CO, June 2011. 6

work page 2011

[18] [18]

Krause, B

J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, and L. Fei-Fei. The unreasonable ef- fectiveness of noisy data for ﬁne-grained recognition. arXiv preprint arXiv:1511.06789, 2015. 6

work page arXiv 2015

[19] [19]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems , pages 1097–1105, 2012. 1, 6

work page 2012

[20] [20]

Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition

V . Lebedev, Y . Ganin, M. Rakhuba, I. Oseledets, and V . Lempitsky. Speeding-up convolutional neural net- works using ﬁne-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014. 2

work page Pith review arXiv 2014

[21] [21]

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. Ssd: Single shot multibox detector. arXiv preprint arXiv:1512.02325, 2015. 7

work page Pith review arXiv 2015

[22] [22]

XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

M. Rastegari, V . Ordonez, J. Redmon, and A. Farhadi. Xnor- net: Imagenet classiﬁcation using binary convolutional neu- ral networks. arXiv preprint arXiv:1603.05279, 2016. 1, 2

work page Pith review arXiv 2016

[23] [23]

S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems , pages 91–99, 2015. 7

work page 2015

[24] [24]

Russakovsky, J

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision , 115(3):211–252,

work page

[25] [25]

Schroff, D

F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni- ﬁed embedding for face recognition and clustering. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015. 8

work page 2015

[26] [26]

L. Sifre. Rigid-motion scattering for image classiﬁcation . PhD thesis, Ph. D. thesis, 2014. 1, 3

work page 2014

[27] [27]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2014

[28] [28]

Sindhwani, T

V . Sindhwani, T. Sainath, and S. Kumar. Structured trans- forms for small-footprint deep learning. In Advances in Neural Information Processing Systems , pages 3088–3096,

work page

[29] [29]

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

C. Szegedy, S. Ioffe, and V . Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016. 1

work page Pith review arXiv 2016

[30] [30]

Szegedy, W

C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 1–9, 2015. 6

work page 2015

[31] [31]

Rethinking the Inception Architecture for Computer Vision

C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015. 1, 3, 4, 7

work page Pith review arXiv 2015

[32] [32]

Thomee, D

B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. Yfcc100m: The new data in multimedia research. Communications of the ACM , 59(2):64–73, 2016. 7

work page 2016

[33] [33]

Tieleman and G

T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning , 4(2),

work page

[34] [34]

M. Wang, B. Liu, and H. Foroosh. Factorized convolutional neural networks. arXiv preprint arXiv:1608.04337, 2016. 1

work page arXiv 2016

[35] [35]

Weyand, I

T. Weyand, I. Kostrikov, and J. Philbin. PlaNet - Photo Ge- olocation with Convolutional Neural Networks. InEuropean Conference on Computer Vision (ECCV), 2016. 6, 7

work page 2016

[36] [36]

J. Wu, C. Leng, Y . Wang, Q. Hu, and J. Cheng. Quantized convolutional neural networks for mobile devices. arXiv preprint arXiv:1512.06473, 2015. 1

work page arXiv 2015

[37] [37]

Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. Wang. Deep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision , pages 1476–1483, 2015. 1

work page 2015