A Unified Optimization Approach for CNN Model Inference on Integrated GPUs
Pith reviewed 2026-05-25 09:23 UTC · model grok-4.3
The pith
A unified intermediate representation lets one optimization pipeline run CNN inference efficiently on integrated GPUs from Intel, ARM, and Nvidia.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A single unified IR together with learned scheduling produces inference code that runs at or above the speed of vendor libraries (up to 1.62×) on Intel Graphics, ARM Mali, and Nvidia Maxwell integrated GPUs for popular CNN models, while supporting a wider range of models and allowing new ones to be added without per-vendor rewrites.
What carries the argument
The unified IR that encodes and optimizes vision-specific operators so they can be lowered to multiple GPU architectures and programming interfaces.
If this is right
- Inference code generated once can target Intel, ARM, and Nvidia integrated GPUs at competitive or superior speed.
- Models outside the current set can be added without writing new vendor-specific kernels.
- Operators that do not map well to any GPU fall back to CPU automatically.
- The same pipeline can be used for both image classification and object detection workloads.
Where Pith is reading between the lines
- Edge-device deployments could stop depending on closed vendor libraries for each new GPU generation.
- The approach might be extended to other operator families beyond convolutions if the IR is enlarged.
- Latency and privacy benefits of on-device inference become available on a broader set of low-power platforms.
- A practical test would be to add support for a recently released integrated GPU and re-run the same benchmark suite.
Load-bearing premise
The single IR can capture every important operator and scheduling choice across the different GPUs without adding overhead large enough to erase the measured speedups.
What would settle it
Measure end-to-end latency of the generated code versus the vendor library on a new integrated-GPU architecture; if the new code is consistently slower by more than a few percent on the same models, the claim does not hold.
Figures
read the original abstract
Modern deep learning applications urge to push the model inference taking place at the edge devices for multiple reasons such as achieving shorter latency, relieving the burden of the network connecting to the cloud, and protecting user privacy. The Convolutional Neural Network (\emph{CNN}) is one of the most widely used model family in the applications. Given the high computational complexity of the CNN models, it is favorable to execute them on the integrated GPUs at the edge devices, which are ubiquitous and have more power and better energy efficiency than the accompanying CPUs. However, programming on integrated GPUs efficiently is challenging due to the variety of their architectures and programming interfaces. This paper proposes an end-to-end solution to execute CNN model inference on the integrated GPUs at the edge, which uses a unified IR to represent and optimize vision-specific operators on integrated GPUs from multiple vendors, as well as leverages machine learning-based scheduling search schemes to optimize computationally-intensive operators like convolution. Our solution even provides a fallback mechanism for operators not suitable or convenient to run on GPUs. The evaluation results suggest that compared to state-of-the-art solutions backed up by the vendor-provided high-performance libraries on Intel Graphics, ARM Mali GPU, and Nvidia integrated Maxwell GPU, our solution achieves similar, or even better (up to 1.62$\times$), performance on a number of popular image classification and object detection models. In addition, our solution has a wider model coverage and is more flexible to embrace new models. Our solution has been adopted in production services in AWS and is open-sourced.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents an end-to-end solution for CNN model inference on integrated GPUs using a unified IR to represent and optimize vision-specific operators across vendors (Intel, ARM, Nvidia), ML-based scheduling search for compute-intensive kernels such as convolution, and a fallback mechanism for unsuitable operators. It claims performance comparable or superior (up to 1.62×) to vendor high-performance libraries on popular image classification and object detection models, plus wider model coverage and flexibility for new models. The solution is open-sourced and adopted in AWS production services.
Significance. If the empirical results hold, the work offers a practical systems contribution for edge DL inference by addressing architectural and interface diversity without sole reliance on vendor libraries. The combination of unified IR, learned scheduling, and fallback provides flexibility. Explicit credit is due for the open-sourcing and production adoption, which aid reproducibility and real-world applicability in edge computing.
minor comments (1)
- Abstract: the performance claims reference 'a number of popular image classification and object detection models' without naming them or the datasets; adding one sentence with examples would improve immediate context for readers.
Simulated Author's Rebuttal
We thank the referee for the thorough review and positive recommendation to accept the paper. The recognition of the practical contributions, open-sourcing, and production adoption is appreciated.
Circularity Check
No significant circularity
full rationale
The paper is an applied systems description of a unified IR and ML scheduler for CNN inference on integrated GPUs. No equations, parameter fits, predictions, or self-citations appear as load-bearing steps in any derivation chain. Performance results are presented as direct empirical measurements against vendor libraries on standard models, with no reduction to self-defined quantities or imported uniqueness theorems. The central claims rest on implementation details and benchmarking rather than any internal definitional loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Vision-specific operators can be represented and optimized in a single IR across multiple GPU architectures without performance loss
Reference graph
Works this paper leans on
-
[1]
Marti Anglada, Enrique de Lucas, Joan-Manuel Parcerisa, Juan L. Aragón, and Antonio González. Early visibility resolution for removing ineffectual computa- tions in the graphics pipeline. In 25th. Int. Symp. on High-Performance Computer Architecture, pages 635–646, 2018
work page 2018
-
[2]
https://www.arm.com/why-arm/technologies/ compute-library
ARM COMPUTE LIBRARY. https://www.arm.com/why-arm/technologies/ compute-library. [Online; accessed 13-Mar-2019]
work page 2019
-
[3]
Neurostream: Scalable and energy efficient deep learning with smart memory cubes
Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. Neurostream: Scalable and energy efficient deep learning with smart memory cubes. IEEE Transactions on Parallel and Distributed Systems , 29:420–434, 2018
work page 2018
-
[4]
Moderngpu: Patterns and behaviors for GPU computing
Sean Baxter. Moderngpu: Patterns and behaviors for GPU computing. http: //moderngpu.github.io/moderngpu, 2013–2016
work page 2013
-
[5]
TVM: An automated end-to-end optimizing compiler for deep learning
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018
work page 2018
-
[6]
Learning to optimize tensor programs
Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31 , pages 3389–3400. Curran Associates, Inc., 2018
work page 2018
-
[7]
cuDNN: Efficient Primitives for Deep Learning
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[8]
Compute Library for Deep Neural Networks (clDNN). https://01.org/cldnn. [Online; accessed 11-Apr-2019]
work page 2019
-
[9]
OpenVINO Toolkit Release Notes
Deanne Deuermeyer and Andrey Z. OpenVINO Toolkit Release Notes. https: //software.intel.com/en-us/articles/OpenVINO-RelNotes. [Online; accessed 3- Jan-2019]
work page 2019
-
[10]
An empirical study of the effect of source-level loop transformations on compiler stability
Zhangxiaowen Gong, Zhi Chen, Justin Szaday, David Wong, Zehra Sura, Nef- tali Watkinson, Saeed Maleki, David Padua, Alexander Veidenbaum, Alexandru Nicolau, and Josep Torrellas. An empirical study of the effect of source-level loop transformations on compiler stability. Proc. ACM Program. Lang. , pages 126:1–126:29, 2018
work page 2018
-
[11]
Energy efficient hpc on embedded socs: Optimization techniques for mali gpu
Ivan Grasso, Petar Radojkovic, Nikola Rajovic, Isaac Gelado, and Alex Ramirez. Energy efficient hpc on embedded socs: Optimization techniques for mali gpu. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS ’14, pages 123–132, 2014
work page 2014
-
[12]
Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[13]
Mark Harris, Shubhabrata Sengupta, and John D. Owens. Parallel prefix sum (scan) with CUDA. In Hubert Nguyen, editor, GPU Gems 3 , chapter 39, pages 851–876. Addison Wesley, August 2007
work page 2007
-
[14]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[15]
W. Daniel Hillis and Guy L. Steele, Jr. Data parallel algorithms. Commun. ACM, 29(12):1170–1183, December 1986
work page 1986
-
[16]
Kaixi Hou, Weifeng Liu, Hao Wang, and Wu-chun Feng. Fast segmented sort on gpus. In Proceedings of the International Conference on Supercomputing , ICS’17, pages 12:1–12:10, 2017
work page 2017
-
[17]
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Effi- cient convolutional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[18]
The OpenCL Specification (Version 2.0, Document Revision 26), October 2014
Lee Howes and Aaftab Munshi. The OpenCL Specification (Version 2.0, Document Revision 26), October 2014. http://www.khronos.org/registry/cl/specs/opencl-2.0. pdf
work page 2014
-
[19]
Deepmon: Mobile gpu- based deep learning framework for continuous vision applications
Loc N Huynh, Youngki Lee, and Rajesh Krishna Balan. Deepmon: Mobile gpu- based deep learning framework for continuous vision applications. InProceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, pages 82–95. ACM, 2017
work page 2017
-
[20]
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size
Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[21]
A Performance Comparison of CUDA and OpenCL
Kamran Karimi, Neil G Dickson, and Firas Hamze. A performance comparison of cuda and opencl. arXiv preprint arXiv:1005.2581, 2010
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[22]
O. Kayiran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H. Loh, O. Mutlu, and C. R. Das. Managing gpu concurrency in heterogeneous archi- tectures. In 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’14), pages 114–126, 2014
work page 2014
-
[23]
Deepx: A software accelerator for low-power deep learning inference on mobile devices
Nicholas D Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. Deepx: A software accelerator for low-power deep learning inference on mobile devices. InProceedings of the 15th International Conference on Information Processing in Sensor Networks , page 23, 2016
work page 2016
-
[24]
Ogleari, Dong Li, and Jishen Zhao
Jiawen Liu, Hengyu Zhao, Matheus A. Ogleari, Dong Li, and Jishen Zhao. Processing-in-memory for energy-efficient neural network training: A hetero- geneous approach. 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 655–668, 2018
work page 2018
-
[25]
Ssd: Single shot multibox detector
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision , pages 21–37, 2016
work page 2016
-
[26]
Optimizing CNN model inference on CPUs
Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang. Optimizing CNN model inference on CPUs. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), Renton, WA, 2019. USENIX Association
work page 2019
-
[27]
Embedded Binarized Neural Networks
Bradley McDanel, Surat Teerapittayanon, and HT Kung. Embedded binarized neural networks. arXiv preprint arXiv:1709.02260, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[28]
Duane Merrill. Cub: Flexible library of cooperative threadblock primitives and other utilities for CUDA kernel programming. https://nvlabs.github.io/cub/, 2013–2016
work page 2013
-
[29]
NVIDIA CUDA C programming guide
NVIDIA Corporation. NVIDIA CUDA C programming guide. PG-02829-001_v6.5, August 2014
work page 2014
-
[30]
Owens, Mike Houston, David Luebke, Simon Green, John E
John D. Owens, Mike Houston, David Luebke, Simon Green, John E. Stone, and James C. Phillips. GPU computing. Proceedings of the IEEE , 96(5):879–899, 2008
work page 2008
-
[31]
T. B. Preußer, G. Gambardella, N. Fraser, and M. Blott. Inference of quantized neural networks on heterogeneous all-programmable devices. In 2018 Design, Automation Test in Europe Conference Exhibition (DATE), pages 833–838, 2018
work page 2018
-
[32]
Programming heterogeneous systems from an image processing dsl
Jing Pu, Steven Bell, Xuan Yang, Jeff Setter, Stephen Richardson, Jonathan Ragan- Kelley, and Mark Horowitz. Programming heterogeneous systems from an image processing dsl. ACM Trans. Archit. Code Optim., 14(3):26:1–26:25, 2017
work page 2017
-
[33]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices, 48(6):519–530, 2013
work page 2013
-
[34]
YOLOv3: An Incremental Improvement
Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[35]
Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens. Scan primitives for GPU computing. In Proceedings of the 22nd ACM SIG- GRAPH/EUROGRAPHICS Symposium on Graphics Hardware, GH ’07, pages 97–106, 2007
work page 2007
-
[36]
MnasNet: Platform-Aware Neural Architecture Search for Mobile
Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[37]
https://developer.nvidia.com/tensorrt
NVIDIA TensorRT. https://developer.nvidia.com/tensorrt. [Online; accessed 11-Apr-2019]
work page 2019
-
[38]
Leyuan Wang, Sean Baxter, and John D. Owens. Fast parallel skew and prefix- doubling suffix array construction on the GPU. Concurrency and Computation: Practice & Experience, 28(12):3466–3484, 2016
work page 2016
-
[39]
C. Wu, D. Brooks, K. Chen, and D. Chen, et al. Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA) , pages 331–344, 2019
work page 2019
- [40]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.