pith. sign in

arxiv: 1907.02154 · v1 · pith:VQ3YVWMGnew · submitted 2019-07-03 · 💻 cs.DC

A Unified Optimization Approach for CNN Model Inference on Integrated GPUs

Pith reviewed 2026-05-25 09:23 UTC · model grok-4.3

classification 💻 cs.DC
keywords CNN inferenceintegrated GPUunified IRoperator optimizationedge devicesconvolution schedulingmachine learning searchvendor library comparison
0
0 comments X

The pith

A unified intermediate representation lets one optimization pipeline run CNN inference efficiently on integrated GPUs from Intel, ARM, and Nvidia.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an end-to-end system that executes CNN inference on the integrated GPUs common in edge devices. It encodes vision operators in a single intermediate representation that can be lowered to the different programming interfaces of multiple GPU vendors, then applies machine-learning search to choose schedules for heavy kernels such as convolution. A CPU fallback path covers any operator that is inconvenient on the GPU. The authors report that the resulting code matches or exceeds the speed of each vendor’s own high-performance library on standard image-classification and detection networks while covering a larger set of models. The central goal is to remove the need to rewrite or retune models for every new integrated-GPU platform.

Core claim

A single unified IR together with learned scheduling produces inference code that runs at or above the speed of vendor libraries (up to 1.62×) on Intel Graphics, ARM Mali, and Nvidia Maxwell integrated GPUs for popular CNN models, while supporting a wider range of models and allowing new ones to be added without per-vendor rewrites.

What carries the argument

The unified IR that encodes and optimizes vision-specific operators so they can be lowered to multiple GPU architectures and programming interfaces.

If this is right

  • Inference code generated once can target Intel, ARM, and Nvidia integrated GPUs at competitive or superior speed.
  • Models outside the current set can be added without writing new vendor-specific kernels.
  • Operators that do not map well to any GPU fall back to CPU automatically.
  • The same pipeline can be used for both image classification and object detection workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Edge-device deployments could stop depending on closed vendor libraries for each new GPU generation.
  • The approach might be extended to other operator families beyond convolutions if the IR is enlarged.
  • Latency and privacy benefits of on-device inference become available on a broader set of low-power platforms.
  • A practical test would be to add support for a recently released integrated GPU and re-run the same benchmark suite.

Load-bearing premise

The single IR can capture every important operator and scheduling choice across the different GPUs without adding overhead large enough to erase the measured speedups.

What would settle it

Measure end-to-end latency of the generated code versus the vendor library on a new integrated-GPU architecture; if the new code is consistently slower by more than a few percent on the same models, the claim does not hold.

Figures

Figures reproduced from arXiv: 1907.02154 by Leyuan Wang, Lianmin Zheng, Mu Li, Yao Wang, Yida Wang, Yizhi Liu, Zhi Chen.

Figure 1
Figure 1. Figure 1: Overview of our working pipeline. Note that the [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Note that although our implementation uses similar ideas [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Prefix sum (Scan) pipeline example. Suppose we [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Modern deep learning applications urge to push the model inference taking place at the edge devices for multiple reasons such as achieving shorter latency, relieving the burden of the network connecting to the cloud, and protecting user privacy. The Convolutional Neural Network (\emph{CNN}) is one of the most widely used model family in the applications. Given the high computational complexity of the CNN models, it is favorable to execute them on the integrated GPUs at the edge devices, which are ubiquitous and have more power and better energy efficiency than the accompanying CPUs. However, programming on integrated GPUs efficiently is challenging due to the variety of their architectures and programming interfaces. This paper proposes an end-to-end solution to execute CNN model inference on the integrated GPUs at the edge, which uses a unified IR to represent and optimize vision-specific operators on integrated GPUs from multiple vendors, as well as leverages machine learning-based scheduling search schemes to optimize computationally-intensive operators like convolution. Our solution even provides a fallback mechanism for operators not suitable or convenient to run on GPUs. The evaluation results suggest that compared to state-of-the-art solutions backed up by the vendor-provided high-performance libraries on Intel Graphics, ARM Mali GPU, and Nvidia integrated Maxwell GPU, our solution achieves similar, or even better (up to 1.62$\times$), performance on a number of popular image classification and object detection models. In addition, our solution has a wider model coverage and is more flexible to embrace new models. Our solution has been adopted in production services in AWS and is open-sourced.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper presents an end-to-end solution for CNN model inference on integrated GPUs using a unified IR to represent and optimize vision-specific operators across vendors (Intel, ARM, Nvidia), ML-based scheduling search for compute-intensive kernels such as convolution, and a fallback mechanism for unsuitable operators. It claims performance comparable or superior (up to 1.62×) to vendor high-performance libraries on popular image classification and object detection models, plus wider model coverage and flexibility for new models. The solution is open-sourced and adopted in AWS production services.

Significance. If the empirical results hold, the work offers a practical systems contribution for edge DL inference by addressing architectural and interface diversity without sole reliance on vendor libraries. The combination of unified IR, learned scheduling, and fallback provides flexibility. Explicit credit is due for the open-sourcing and production adoption, which aid reproducibility and real-world applicability in edge computing.

minor comments (1)
  1. Abstract: the performance claims reference 'a number of popular image classification and object detection models' without naming them or the datasets; adding one sentence with examples would improve immediate context for readers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the thorough review and positive recommendation to accept the paper. The recognition of the practical contributions, open-sourcing, and production adoption is appreciated.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an applied systems description of a unified IR and ML scheduler for CNN inference on integrated GPUs. No equations, parameter fits, predictions, or self-citations appear as load-bearing steps in any derivation chain. Performance results are presented as direct empirical measurements against vendor libraries on standard models, with no reduction to self-defined quantities or imported uniqueness theorems. The central claims rest on implementation details and benchmarking rather than any internal definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly rests on the domain assumption that vision operators admit a lossless unified representation across GPU vendors.

axioms (1)
  • domain assumption Vision-specific operators can be represented and optimized in a single IR across multiple GPU architectures without performance loss
    Invoked in the description of the unified IR and multi-vendor support.

pith-pipeline@v0.9.0 · 5817 in / 1250 out tokens · 30975 ms · 2026-05-25T09:23:12.188452+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 8 internal anchors

  1. [1]

    Aragón, and Antonio González

    Marti Anglada, Enrique de Lucas, Joan-Manuel Parcerisa, Juan L. Aragón, and Antonio González. Early visibility resolution for removing ineffectual computa- tions in the graphics pipeline. In 25th. Int. Symp. on High-Performance Computer Architecture, pages 635–646, 2018

  2. [2]

    https://www.arm.com/why-arm/technologies/ compute-library

    ARM COMPUTE LIBRARY. https://www.arm.com/why-arm/technologies/ compute-library. [Online; accessed 13-Mar-2019]

  3. [3]

    Neurostream: Scalable and energy efficient deep learning with smart memory cubes

    Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. Neurostream: Scalable and energy efficient deep learning with smart memory cubes. IEEE Transactions on Parallel and Distributed Systems , 29:420–434, 2018

  4. [4]

    Moderngpu: Patterns and behaviors for GPU computing

    Sean Baxter. Moderngpu: Patterns and behaviors for GPU computing. http: //moderngpu.github.io/moderngpu, 2013–2016

  5. [5]

    TVM: An automated end-to-end optimizing compiler for deep learning

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018

  6. [6]

    Learning to optimize tensor programs

    Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31 , pages 3389–3400. Curran Associates, Inc., 2018

  7. [7]

    cuDNN: Efficient Primitives for Deep Learning

    Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014

  8. [8]

    https://01.org/cldnn

    Compute Library for Deep Neural Networks (clDNN). https://01.org/cldnn. [Online; accessed 11-Apr-2019]

  9. [9]

    OpenVINO Toolkit Release Notes

    Deanne Deuermeyer and Andrey Z. OpenVINO Toolkit Release Notes. https: //software.intel.com/en-us/articles/OpenVINO-RelNotes. [Online; accessed 3- Jan-2019]

  10. [10]

    An empirical study of the effect of source-level loop transformations on compiler stability

    Zhangxiaowen Gong, Zhi Chen, Justin Szaday, David Wong, Zehra Sura, Nef- tali Watkinson, Saeed Maleki, David Padua, Alexander Veidenbaum, Alexandru Nicolau, and Josep Torrellas. An empirical study of the effect of source-level loop transformations on compiler stability. Proc. ACM Program. Lang. , pages 126:1–126:29, 2018

  11. [11]

    Energy efficient hpc on embedded socs: Optimization techniques for mali gpu

    Ivan Grasso, Petar Radojkovic, Nikola Rajovic, Isaac Gelado, and Alex Ramirez. Energy efficient hpc on embedded socs: Optimization techniques for mali gpu. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS ’14, pages 123–132, 2014

  12. [12]

    Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015

  13. [13]

    Mark Harris, Shubhabrata Sengupta, and John D. Owens. Parallel prefix sum (scan) with CUDA. In Hubert Nguyen, editor, GPU Gems 3 , chapter 39, pages 851–876. Addison Wesley, August 2007

  14. [14]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  15. [15]

    Daniel Hillis and Guy L

    W. Daniel Hillis and Guy L. Steele, Jr. Data parallel algorithms. Commun. ACM, 29(12):1170–1183, December 1986

  16. [16]

    Fast segmented sort on gpus

    Kaixi Hou, Weifeng Liu, Hao Wang, and Wu-chun Feng. Fast segmented sort on gpus. In Proceedings of the International Conference on Supercomputing , ICS’17, pages 12:1–12:10, 2017

  17. [17]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Effi- cient convolutional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861, 2017

  18. [18]

    The OpenCL Specification (Version 2.0, Document Revision 26), October 2014

    Lee Howes and Aaftab Munshi. The OpenCL Specification (Version 2.0, Document Revision 26), October 2014. http://www.khronos.org/registry/cl/specs/opencl-2.0. pdf

  19. [19]

    Deepmon: Mobile gpu- based deep learning framework for continuous vision applications

    Loc N Huynh, Youngki Lee, and Rajesh Krishna Balan. Deepmon: Mobile gpu- based deep learning framework for continuous vision applications. InProceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, pages 82–95. ACM, 2017

  20. [20]

    SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

    Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016

  21. [21]

    A Performance Comparison of CUDA and OpenCL

    Kamran Karimi, Neil G Dickson, and Firas Hamze. A performance comparison of cuda and opencl. arXiv preprint arXiv:1005.2581, 2010

  22. [22]

    Kayiran, N

    O. Kayiran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H. Loh, O. Mutlu, and C. R. Das. Managing gpu concurrency in heterogeneous archi- tectures. In 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’14), pages 114–126, 2014

  23. [23]

    Deepx: A software accelerator for low-power deep learning inference on mobile devices

    Nicholas D Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. Deepx: A software accelerator for low-power deep learning inference on mobile devices. InProceedings of the 15th International Conference on Information Processing in Sensor Networks , page 23, 2016

  24. [24]

    Ogleari, Dong Li, and Jishen Zhao

    Jiawen Liu, Hengyu Zhao, Matheus A. Ogleari, Dong Li, and Jishen Zhao. Processing-in-memory for energy-efficient neural network training: A hetero- geneous approach. 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 655–668, 2018

  25. [25]

    Ssd: Single shot multibox detector

    Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision , pages 21–37, 2016

  26. [26]

    Optimizing CNN model inference on CPUs

    Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang. Optimizing CNN model inference on CPUs. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), Renton, WA, 2019. USENIX Association

  27. [27]

    Embedded Binarized Neural Networks

    Bradley McDanel, Surat Teerapittayanon, and HT Kung. Embedded binarized neural networks. arXiv preprint arXiv:1709.02260, 2017

  28. [28]

    Cub: Flexible library of cooperative threadblock primitives and other utilities for CUDA kernel programming

    Duane Merrill. Cub: Flexible library of cooperative threadblock primitives and other utilities for CUDA kernel programming. https://nvlabs.github.io/cub/, 2013–2016

  29. [29]

    NVIDIA CUDA C programming guide

    NVIDIA Corporation. NVIDIA CUDA C programming guide. PG-02829-001_v6.5, August 2014

  30. [30]

    Owens, Mike Houston, David Luebke, Simon Green, John E

    John D. Owens, Mike Houston, David Luebke, Simon Green, John E. Stone, and James C. Phillips. GPU computing. Proceedings of the IEEE , 96(5):879–899, 2008

  31. [31]

    T. B. Preußer, G. Gambardella, N. Fraser, and M. Blott. Inference of quantized neural networks on heterogeneous all-programmable devices. In 2018 Design, Automation Test in Europe Conference Exhibition (DATE), pages 833–838, 2018

  32. [32]

    Programming heterogeneous systems from an image processing dsl

    Jing Pu, Steven Bell, Xuan Yang, Jeff Setter, Stephen Richardson, Jonathan Ragan- Kelley, and Mark Horowitz. Programming heterogeneous systems from an image processing dsl. ACM Trans. Archit. Code Optim., 14(3):26:1–26:25, 2017

  33. [33]

    Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

    Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices, 48(6):519–530, 2013

  34. [34]

    YOLOv3: An Incremental Improvement

    Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018

  35. [35]

    Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens. Scan primitives for GPU computing. In Proceedings of the 22nd ACM SIG- GRAPH/EUROGRAPHICS Symposium on Graphics Hardware, GH ’07, pages 97–106, 2007

  36. [36]

    MnasNet: Platform-Aware Neural Architecture Search for Mobile

    Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018

  37. [37]

    https://developer.nvidia.com/tensorrt

    NVIDIA TensorRT. https://developer.nvidia.com/tensorrt. [Online; accessed 11-Apr-2019]

  38. [38]

    Leyuan Wang, Sean Baxter, and John D. Owens. Fast parallel skew and prefix- doubling suffix array construction on the GPU. Concurrency and Computation: Practice & Experience, 28(12):3466–3484, 2016

  39. [39]

    C. Wu, D. Brooks, K. Chen, and D. Chen, et al. Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA) , pages 331–344, 2019

  40. [40]

    Zhang, X

    X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient con- volutional neural network for mobile devices. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6848–6856, 2018