A Unified Optimization Approach for CNN Model Inference on Integrated GPUs

Leyuan Wang; Lianmin Zheng; Mu Li; Yao Wang; Yida Wang; Yizhi Liu; Zhi Chen

arxiv: 1907.02154 · v1 · pith:VQ3YVWMGnew · submitted 2019-07-03 · 💻 cs.DC

A Unified Optimization Approach for CNN Model Inference on Integrated GPUs

Leyuan Wang , Zhi Chen , Yizhi Liu , Yao Wang , Lianmin Zheng , Mu Li , Yida Wang This is my paper

Pith reviewed 2026-05-25 09:23 UTC · model grok-4.3

classification 💻 cs.DC

keywords CNN inferenceintegrated GPUunified IRoperator optimizationedge devicesconvolution schedulingmachine learning searchvendor library comparison

0 comments

The pith

A unified intermediate representation lets one optimization pipeline run CNN inference efficiently on integrated GPUs from Intel, ARM, and Nvidia.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an end-to-end system that executes CNN inference on the integrated GPUs common in edge devices. It encodes vision operators in a single intermediate representation that can be lowered to the different programming interfaces of multiple GPU vendors, then applies machine-learning search to choose schedules for heavy kernels such as convolution. A CPU fallback path covers any operator that is inconvenient on the GPU. The authors report that the resulting code matches or exceeds the speed of each vendor’s own high-performance library on standard image-classification and detection networks while covering a larger set of models. The central goal is to remove the need to rewrite or retune models for every new integrated-GPU platform.

Core claim

A single unified IR together with learned scheduling produces inference code that runs at or above the speed of vendor libraries (up to 1.62×) on Intel Graphics, ARM Mali, and Nvidia Maxwell integrated GPUs for popular CNN models, while supporting a wider range of models and allowing new ones to be added without per-vendor rewrites.

What carries the argument

The unified IR that encodes and optimizes vision-specific operators so they can be lowered to multiple GPU architectures and programming interfaces.

If this is right

Inference code generated once can target Intel, ARM, and Nvidia integrated GPUs at competitive or superior speed.
Models outside the current set can be added without writing new vendor-specific kernels.
Operators that do not map well to any GPU fall back to CPU automatically.
The same pipeline can be used for both image classification and object detection workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Edge-device deployments could stop depending on closed vendor libraries for each new GPU generation.
The approach might be extended to other operator families beyond convolutions if the IR is enlarged.
Latency and privacy benefits of on-device inference become available on a broader set of low-power platforms.
A practical test would be to add support for a recently released integrated GPU and re-run the same benchmark suite.

Load-bearing premise

The single IR can capture every important operator and scheduling choice across the different GPUs without adding overhead large enough to erase the measured speedups.

What would settle it

Measure end-to-end latency of the generated code versus the vendor library on a new integrated-GPU architecture; if the new code is consistently slower by more than a few percent on the same models, the claim does not hold.

Figures

Figures reproduced from arXiv: 1907.02154 by Leyuan Wang, Lianmin Zheng, Mu Li, Yao Wang, Yida Wang, Yizhi Liu, Zhi Chen.

**Figure 2.** Figure 2: Note that although our implementation uses similar ideas [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Prefix sum (Scan) pipeline example. Suppose we [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Modern deep learning applications urge to push the model inference taking place at the edge devices for multiple reasons such as achieving shorter latency, relieving the burden of the network connecting to the cloud, and protecting user privacy. The Convolutional Neural Network (\emph{CNN}) is one of the most widely used model family in the applications. Given the high computational complexity of the CNN models, it is favorable to execute them on the integrated GPUs at the edge devices, which are ubiquitous and have more power and better energy efficiency than the accompanying CPUs. However, programming on integrated GPUs efficiently is challenging due to the variety of their architectures and programming interfaces. This paper proposes an end-to-end solution to execute CNN model inference on the integrated GPUs at the edge, which uses a unified IR to represent and optimize vision-specific operators on integrated GPUs from multiple vendors, as well as leverages machine learning-based scheduling search schemes to optimize computationally-intensive operators like convolution. Our solution even provides a fallback mechanism for operators not suitable or convenient to run on GPUs. The evaluation results suggest that compared to state-of-the-art solutions backed up by the vendor-provided high-performance libraries on Intel Graphics, ARM Mali GPU, and Nvidia integrated Maxwell GPU, our solution achieves similar, or even better (up to 1.62$\times$), performance on a number of popular image classification and object detection models. In addition, our solution has a wider model coverage and is more flexible to embrace new models. Our solution has been adopted in production services in AWS and is open-sourced.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper ships a working cross-vendor system for CNN inference on integrated GPUs that matches or beats vendor libraries on standard models while adding flexibility and production use.

read the letter

The main takeaway is a practical engineering system that runs CNN inference on Intel, ARM, and Nvidia integrated GPUs without locking into any single vendor stack. It uses one IR to represent vision operators, applies ML search for scheduling compute-heavy kernels like convolution, and falls back to CPU when needed. The reported results show parity or up to 1.62x gains versus vendor libraries on image classification and detection models, plus wider model coverage than the baselines. The work is already open-sourced and running in AWS production services, which gives the claims some external check.

Referee Report

0 major / 1 minor

Summary. The paper presents an end-to-end solution for CNN model inference on integrated GPUs using a unified IR to represent and optimize vision-specific operators across vendors (Intel, ARM, Nvidia), ML-based scheduling search for compute-intensive kernels such as convolution, and a fallback mechanism for unsuitable operators. It claims performance comparable or superior (up to 1.62×) to vendor high-performance libraries on popular image classification and object detection models, plus wider model coverage and flexibility for new models. The solution is open-sourced and adopted in AWS production services.

Significance. If the empirical results hold, the work offers a practical systems contribution for edge DL inference by addressing architectural and interface diversity without sole reliance on vendor libraries. The combination of unified IR, learned scheduling, and fallback provides flexibility. Explicit credit is due for the open-sourcing and production adoption, which aid reproducibility and real-world applicability in edge computing.

minor comments (1)

Abstract: the performance claims reference 'a number of popular image classification and object detection models' without naming them or the datasets; adding one sentence with examples would improve immediate context for readers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the thorough review and positive recommendation to accept the paper. The recognition of the practical contributions, open-sourcing, and production adoption is appreciated.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an applied systems description of a unified IR and ML scheduler for CNN inference on integrated GPUs. No equations, parameter fits, predictions, or self-citations appear as load-bearing steps in any derivation chain. Performance results are presented as direct empirical measurements against vendor libraries on standard models, with no reduction to self-defined quantities or imported uniqueness theorems. The central claims rest on implementation details and benchmarking rather than any internal definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly rests on the domain assumption that vision operators admit a lossless unified representation across GPU vendors.

axioms (1)

domain assumption Vision-specific operators can be represented and optimized in a single IR across multiple GPU architectures without performance loss
Invoked in the description of the unified IR and multi-vendor support.

pith-pipeline@v0.9.0 · 5817 in / 1250 out tokens · 30975 ms · 2026-05-25T09:23:12.188452+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 8 internal anchors

[1]

Aragón, and Antonio González

Marti Anglada, Enrique de Lucas, Joan-Manuel Parcerisa, Juan L. Aragón, and Antonio González. Early visibility resolution for removing ineffectual computa- tions in the graphics pipeline. In 25th. Int. Symp. on High-Performance Computer Architecture, pages 635–646, 2018

work page 2018
[2]

https://www.arm.com/why-arm/technologies/ compute-library

ARM COMPUTE LIBRARY. https://www.arm.com/why-arm/technologies/ compute-library. [Online; accessed 13-Mar-2019]

work page 2019
[3]

Neurostream: Scalable and energy efficient deep learning with smart memory cubes

Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. Neurostream: Scalable and energy efficient deep learning with smart memory cubes. IEEE Transactions on Parallel and Distributed Systems , 29:420–434, 2018

work page 2018
[4]

Moderngpu: Patterns and behaviors for GPU computing

Sean Baxter. Moderngpu: Patterns and behaviors for GPU computing. http: //moderngpu.github.io/moderngpu, 2013–2016

work page 2013
[5]

TVM: An automated end-to-end optimizing compiler for deep learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018

work page 2018
[6]

Learning to optimize tensor programs

Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31 , pages 3389–3400. Curran Associates, Inc., 2018

work page 2018
[7]

cuDNN: Efficient Primitives for Deep Learning

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[8]

https://01.org/cldnn

Compute Library for Deep Neural Networks (clDNN). https://01.org/cldnn. [Online; accessed 11-Apr-2019]

work page 2019
[9]

OpenVINO Toolkit Release Notes

Deanne Deuermeyer and Andrey Z. OpenVINO Toolkit Release Notes. https: //software.intel.com/en-us/articles/OpenVINO-RelNotes. [Online; accessed 3- Jan-2019]

work page 2019
[10]

An empirical study of the effect of source-level loop transformations on compiler stability

Zhangxiaowen Gong, Zhi Chen, Justin Szaday, David Wong, Zehra Sura, Nef- tali Watkinson, Saeed Maleki, David Padua, Alexander Veidenbaum, Alexandru Nicolau, and Josep Torrellas. An empirical study of the effect of source-level loop transformations on compiler stability. Proc. ACM Program. Lang. , pages 126:1–126:29, 2018

work page 2018
[11]

Energy efficient hpc on embedded socs: Optimization techniques for mali gpu

Ivan Grasso, Petar Radojkovic, Nikola Rajovic, Isaac Gelado, and Alex Ramirez. Energy efficient hpc on embedded socs: Optimization techniques for mali gpu. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS ’14, pages 123–132, 2014

work page 2014
[12]

Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[13]

Mark Harris, Shubhabrata Sengupta, and John D. Owens. Parallel prefix sum (scan) with CUDA. In Hubert Nguyen, editor, GPU Gems 3 , chapter 39, pages 851–876. Addison Wesley, August 2007

work page 2007
[14]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[15]

Daniel Hillis and Guy L

W. Daniel Hillis and Guy L. Steele, Jr. Data parallel algorithms. Commun. ACM, 29(12):1170–1183, December 1986

work page 1986
[16]

Fast segmented sort on gpus

Kaixi Hou, Weifeng Liu, Hao Wang, and Wu-chun Feng. Fast segmented sort on gpus. In Proceedings of the International Conference on Supercomputing , ICS’17, pages 12:1–12:10, 2017

work page 2017
[17]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Effi- cient convolutional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

The OpenCL Specification (Version 2.0, Document Revision 26), October 2014

Lee Howes and Aaftab Munshi. The OpenCL Specification (Version 2.0, Document Revision 26), October 2014. http://www.khronos.org/registry/cl/specs/opencl-2.0. pdf

work page 2014
[19]

Deepmon: Mobile gpu- based deep learning framework for continuous vision applications

Loc N Huynh, Youngki Lee, and Rajesh Krishna Balan. Deepmon: Mobile gpu- based deep learning framework for continuous vision applications. InProceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, pages 82–95. ACM, 2017

work page 2017
[20]

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[21]

A Performance Comparison of CUDA and OpenCL

Kamran Karimi, Neil G Dickson, and Firas Hamze. A performance comparison of cuda and opencl. arXiv preprint arXiv:1005.2581, 2010

work page internal anchor Pith review Pith/arXiv arXiv 2010
[22]

Kayiran, N

O. Kayiran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H. Loh, O. Mutlu, and C. R. Das. Managing gpu concurrency in heterogeneous archi- tectures. In 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’14), pages 114–126, 2014

work page 2014
[23]

Deepx: A software accelerator for low-power deep learning inference on mobile devices

Nicholas D Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. Deepx: A software accelerator for low-power deep learning inference on mobile devices. InProceedings of the 15th International Conference on Information Processing in Sensor Networks , page 23, 2016

work page 2016
[24]

Ogleari, Dong Li, and Jishen Zhao

Jiawen Liu, Hengyu Zhao, Matheus A. Ogleari, Dong Li, and Jishen Zhao. Processing-in-memory for energy-efficient neural network training: A hetero- geneous approach. 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 655–668, 2018

work page 2018
[25]

Ssd: Single shot multibox detector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision , pages 21–37, 2016

work page 2016
[26]

Optimizing CNN model inference on CPUs

Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang. Optimizing CNN model inference on CPUs. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), Renton, WA, 2019. USENIX Association

work page 2019
[27]

Embedded Binarized Neural Networks

Bradley McDanel, Surat Teerapittayanon, and HT Kung. Embedded binarized neural networks. arXiv preprint arXiv:1709.02260, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Cub: Flexible library of cooperative threadblock primitives and other utilities for CUDA kernel programming

Duane Merrill. Cub: Flexible library of cooperative threadblock primitives and other utilities for CUDA kernel programming. https://nvlabs.github.io/cub/, 2013–2016

work page 2013
[29]

NVIDIA CUDA C programming guide

NVIDIA Corporation. NVIDIA CUDA C programming guide. PG-02829-001_v6.5, August 2014

work page 2014
[30]

Owens, Mike Houston, David Luebke, Simon Green, John E

John D. Owens, Mike Houston, David Luebke, Simon Green, John E. Stone, and James C. Phillips. GPU computing. Proceedings of the IEEE , 96(5):879–899, 2008

work page 2008
[31]

T. B. Preußer, G. Gambardella, N. Fraser, and M. Blott. Inference of quantized neural networks on heterogeneous all-programmable devices. In 2018 Design, Automation Test in Europe Conference Exhibition (DATE), pages 833–838, 2018

work page 2018
[32]

Programming heterogeneous systems from an image processing dsl

Jing Pu, Steven Bell, Xuan Yang, Jeff Setter, Stephen Richardson, Jonathan Ragan- Kelley, and Mark Horowitz. Programming heterogeneous systems from an image processing dsl. ACM Trans. Archit. Code Optim., 14(3):26:1–26:25, 2017

work page 2017
[33]

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices, 48(6):519–530, 2013

work page 2013
[34]

YOLOv3: An Incremental Improvement

Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[35]

Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens. Scan primitives for GPU computing. In Proceedings of the 22nd ACM SIG- GRAPH/EUROGRAPHICS Symposium on Graphics Hardware, GH ’07, pages 97–106, 2007

work page 2007
[36]

MnasNet: Platform-Aware Neural Architecture Search for Mobile

Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[37]

https://developer.nvidia.com/tensorrt

NVIDIA TensorRT. https://developer.nvidia.com/tensorrt. [Online; accessed 11-Apr-2019]

work page 2019
[38]

Leyuan Wang, Sean Baxter, and John D. Owens. Fast parallel skew and prefix- doubling suffix array construction on the GPU. Concurrency and Computation: Practice & Experience, 28(12):3466–3484, 2016

work page 2016
[39]

C. Wu, D. Brooks, K. Chen, and D. Chen, et al. Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA) , pages 331–344, 2019

work page 2019
[40]

Zhang, X

X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient con- volutional neural network for mobile devices. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6848–6856, 2018

work page 2018

[1] [1]

Aragón, and Antonio González

Marti Anglada, Enrique de Lucas, Joan-Manuel Parcerisa, Juan L. Aragón, and Antonio González. Early visibility resolution for removing ineffectual computa- tions in the graphics pipeline. In 25th. Int. Symp. on High-Performance Computer Architecture, pages 635–646, 2018

work page 2018

[2] [2]

https://www.arm.com/why-arm/technologies/ compute-library

ARM COMPUTE LIBRARY. https://www.arm.com/why-arm/technologies/ compute-library. [Online; accessed 13-Mar-2019]

work page 2019

[3] [3]

Neurostream: Scalable and energy efficient deep learning with smart memory cubes

Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. Neurostream: Scalable and energy efficient deep learning with smart memory cubes. IEEE Transactions on Parallel and Distributed Systems , 29:420–434, 2018

work page 2018

[4] [4]

Moderngpu: Patterns and behaviors for GPU computing

Sean Baxter. Moderngpu: Patterns and behaviors for GPU computing. http: //moderngpu.github.io/moderngpu, 2013–2016

work page 2013

[5] [5]

TVM: An automated end-to-end optimizing compiler for deep learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018

work page 2018

[6] [6]

Learning to optimize tensor programs

Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31 , pages 3389–3400. Curran Associates, Inc., 2018

work page 2018

[7] [7]

cuDNN: Efficient Primitives for Deep Learning

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[8] [8]

https://01.org/cldnn

Compute Library for Deep Neural Networks (clDNN). https://01.org/cldnn. [Online; accessed 11-Apr-2019]

work page 2019

[9] [9]

OpenVINO Toolkit Release Notes

Deanne Deuermeyer and Andrey Z. OpenVINO Toolkit Release Notes. https: //software.intel.com/en-us/articles/OpenVINO-RelNotes. [Online; accessed 3- Jan-2019]

work page 2019

[10] [10]

An empirical study of the effect of source-level loop transformations on compiler stability

Zhangxiaowen Gong, Zhi Chen, Justin Szaday, David Wong, Zehra Sura, Nef- tali Watkinson, Saeed Maleki, David Padua, Alexander Veidenbaum, Alexandru Nicolau, and Josep Torrellas. An empirical study of the effect of source-level loop transformations on compiler stability. Proc. ACM Program. Lang. , pages 126:1–126:29, 2018

work page 2018

[11] [11]

Energy efficient hpc on embedded socs: Optimization techniques for mali gpu

Ivan Grasso, Petar Radojkovic, Nikola Rajovic, Isaac Gelado, and Alex Ramirez. Energy efficient hpc on embedded socs: Optimization techniques for mali gpu. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS ’14, pages 123–132, 2014

work page 2014

[12] [12]

Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[13] [13]

Mark Harris, Shubhabrata Sengupta, and John D. Owens. Parallel prefix sum (scan) with CUDA. In Hubert Nguyen, editor, GPU Gems 3 , chapter 39, pages 851–876. Addison Wesley, August 2007

work page 2007

[14] [14]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[15] [15]

Daniel Hillis and Guy L

W. Daniel Hillis and Guy L. Steele, Jr. Data parallel algorithms. Commun. ACM, 29(12):1170–1183, December 1986

work page 1986

[16] [16]

Fast segmented sort on gpus

Kaixi Hou, Weifeng Liu, Hao Wang, and Wu-chun Feng. Fast segmented sort on gpus. In Proceedings of the International Conference on Supercomputing , ICS’17, pages 12:1–12:10, 2017

work page 2017

[17] [17]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Effi- cient convolutional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [18]

The OpenCL Specification (Version 2.0, Document Revision 26), October 2014

Lee Howes and Aaftab Munshi. The OpenCL Specification (Version 2.0, Document Revision 26), October 2014. http://www.khronos.org/registry/cl/specs/opencl-2.0. pdf

work page 2014

[19] [19]

Deepmon: Mobile gpu- based deep learning framework for continuous vision applications

Loc N Huynh, Youngki Lee, and Rajesh Krishna Balan. Deepmon: Mobile gpu- based deep learning framework for continuous vision applications. InProceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, pages 82–95. ACM, 2017

work page 2017

[20] [20]

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[21] [21]

A Performance Comparison of CUDA and OpenCL

Kamran Karimi, Neil G Dickson, and Firas Hamze. A performance comparison of cuda and opencl. arXiv preprint arXiv:1005.2581, 2010

work page internal anchor Pith review Pith/arXiv arXiv 2010

[22] [22]

Kayiran, N

O. Kayiran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H. Loh, O. Mutlu, and C. R. Das. Managing gpu concurrency in heterogeneous archi- tectures. In 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’14), pages 114–126, 2014

work page 2014

[23] [23]

Deepx: A software accelerator for low-power deep learning inference on mobile devices

Nicholas D Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. Deepx: A software accelerator for low-power deep learning inference on mobile devices. InProceedings of the 15th International Conference on Information Processing in Sensor Networks , page 23, 2016

work page 2016

[24] [24]

Ogleari, Dong Li, and Jishen Zhao

Jiawen Liu, Hengyu Zhao, Matheus A. Ogleari, Dong Li, and Jishen Zhao. Processing-in-memory for energy-efficient neural network training: A hetero- geneous approach. 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 655–668, 2018

work page 2018

[25] [25]

Ssd: Single shot multibox detector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision , pages 21–37, 2016

work page 2016

[26] [26]

Optimizing CNN model inference on CPUs

Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang. Optimizing CNN model inference on CPUs. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), Renton, WA, 2019. USENIX Association

work page 2019

[27] [27]

Embedded Binarized Neural Networks

Bradley McDanel, Surat Teerapittayanon, and HT Kung. Embedded binarized neural networks. arXiv preprint arXiv:1709.02260, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

Cub: Flexible library of cooperative threadblock primitives and other utilities for CUDA kernel programming

Duane Merrill. Cub: Flexible library of cooperative threadblock primitives and other utilities for CUDA kernel programming. https://nvlabs.github.io/cub/, 2013–2016

work page 2013

[29] [29]

NVIDIA CUDA C programming guide

NVIDIA Corporation. NVIDIA CUDA C programming guide. PG-02829-001_v6.5, August 2014

work page 2014

[30] [30]

Owens, Mike Houston, David Luebke, Simon Green, John E

John D. Owens, Mike Houston, David Luebke, Simon Green, John E. Stone, and James C. Phillips. GPU computing. Proceedings of the IEEE , 96(5):879–899, 2008

work page 2008

[31] [31]

T. B. Preußer, G. Gambardella, N. Fraser, and M. Blott. Inference of quantized neural networks on heterogeneous all-programmable devices. In 2018 Design, Automation Test in Europe Conference Exhibition (DATE), pages 833–838, 2018

work page 2018

[32] [32]

Programming heterogeneous systems from an image processing dsl

Jing Pu, Steven Bell, Xuan Yang, Jeff Setter, Stephen Richardson, Jonathan Ragan- Kelley, and Mark Horowitz. Programming heterogeneous systems from an image processing dsl. ACM Trans. Archit. Code Optim., 14(3):26:1–26:25, 2017

work page 2017

[33] [33]

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices, 48(6):519–530, 2013

work page 2013

[34] [34]

YOLOv3: An Incremental Improvement

Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[35] [35]

Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens. Scan primitives for GPU computing. In Proceedings of the 22nd ACM SIG- GRAPH/EUROGRAPHICS Symposium on Graphics Hardware, GH ’07, pages 97–106, 2007

work page 2007

[36] [36]

MnasNet: Platform-Aware Neural Architecture Search for Mobile

Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[37] [37]

https://developer.nvidia.com/tensorrt

NVIDIA TensorRT. https://developer.nvidia.com/tensorrt. [Online; accessed 11-Apr-2019]

work page 2019

[38] [38]

Leyuan Wang, Sean Baxter, and John D. Owens. Fast parallel skew and prefix- doubling suffix array construction on the GPU. Concurrency and Computation: Practice & Experience, 28(12):3466–3484, 2016

work page 2016

[39] [39]

C. Wu, D. Brooks, K. Chen, and D. Chen, et al. Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA) , pages 331–344, 2019

work page 2019

[40] [40]

Zhang, X

X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient con- volutional neural network for mobile devices. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6848–6856, 2018

work page 2018