pith. sign in

hub

CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it
abstract

Deep Neural Networks are becoming increasingly popular in always-on IoT edge devices performing data analytics right at the source, reducing latency as well as energy consumption for data communication. This paper presents CMSIS-NN, efficient kernels developed to maximize the performance and minimize the memory footprint of neural network (NN) applications on Arm Cortex-M processors targeted for intelligent IoT edge devices. Neural network inference based on CMSIS-NN kernels achieves 4.6X improvement in runtime/throughput and 4.9X improvement in energy efficiency.

hub tools

citation-role summary

background 2 baseline 1

citation-polarity summary

years

2026 14 2018 1

clear filters

representative citing papers

Federated Learning with Non-IID Data

cs.LG · 2018-06-02 · conditional · novelty 6.0

Non-IID data causes up to 55% accuracy loss in federated learning due to weight divergence measured by earth mover's distance; 5% globally shared data recovers 30% accuracy on CIFAR-10.

Split CNN Inference on Networked Microcontrollers

cs.DC · 2026-05-10 · unverdicted · novelty 6.0

A fine-grained split inference system enables CNN models infeasible on single MCUs to run across networked devices by partitioning at sub-layer granularity, reducing per-device peak RAM while keeping practical latency.

citing papers explorer

Showing 2 of 2 citing papers after filters.

  • Split CNN Inference on Networked Microcontrollers cs.DC · 2026-05-10 · unverdicted · none · ref 27

    A fine-grained split inference system enables CNN models infeasible on single MCUs to run across networked devices by partitioning at sub-layer granularity, reducing per-device peak RAM while keeping practical latency.

  • Co-Design of CNN Accelerators for TinyML using Approximate Matrix Decomposition cs.AR · 2026-04-17 · unverdicted · none · ref 7

    A co-design framework using approximate matrix decomposition and genetic algorithms delivers 33% average latency reduction in TinyML CNN FPGA accelerators with 1.3% average accuracy loss versus standard systolic arrays.