The Indirect Convolution Algorithm

Marat Dukhan

arxiv: 1907.02129 · v1 · pith:OWC2RKXKnew · submitted 2019-07-03 · 💻 cs.CV · cs.LG· cs.NE

The Indirect Convolution Algorithm

Marat Dukhan This is my paper

Pith reviewed 2026-05-25 10:00 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.NE

keywords convolutionGEMMim2coldeep learningcomputer visionindirection buffermemory optimizationperformance

0 comments

The pith

The Indirect Convolution algorithm performs GEMM-based convolutions using an indirection buffer of pointers instead of an im2col data transformation, reducing memory overhead in proportion to input channels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep learning frameworks commonly convert convolutions into matrix multiplications via GEMM calls from BLAS libraries, but kernels larger than 1x1 require an im2col step that copies and rearranges image data into a temporary matrix. The paper presents the Indirect Convolution method, which avoids this copy by building a compact buffer of pointers to the original image rows and adapting the GEMM routine to read through those pointers. This change preserves the performance of highly tuned matrix multiplication while eliminating the extra memory traffic and storage. A reader would care because memory bandwidth limits many convolution workloads in practice, and the savings grow directly with the number of input channels.

Core claim

The Indirect Convolution algorithm provides the efficiency of the GEMM primitive without the overhead of im2col transformation. In contrast to GEMM-based algorithms, the Indirect Convolution does not reshuffle the data to fit into the GEMM primitive but introduces an indirection buffer - a buffer of pointers to the start of each row of image pixels. This broadens the application of the modified GEMM function to convolutions with arbitrary kernel size, padding, stride, and dilation. The Indirect Convolution algorithm reduces memory overhead proportionally to the number of input channels and outperforms the GEMM-based algorithm by up to 62% on convolution parameters which involve im2col in GEM

What carries the argument

Indirection buffer of pointers to image pixel rows, which lets a modified GEMM routine compute convolutions of any kernel size without first copying data into a packed matrix.

If this is right

Memory overhead drops in direct proportion to the number of input channels.
Performance improves by as much as 62 percent versus GEMM-plus-im2col on convolutions that require the transformation.
The same GEMM routine now works for arbitrary kernel sizes, padding, stride, and dilation without extra layout code.
A small slowdown appears only on 1x1 stride-1 convolutions that never needed im2col.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may deliver larger relative gains on memory-bandwidth-limited hardware such as mobile CPUs or embedded processors.
Framework implementers could adopt it selectively for kernel sizes greater than 1x1 while retaining direct GEMM for 1x1 cases.
The pointer-based access pattern could interact with cache prefetchers in ways that vary across CPU microarchitectures.
Combining the indirection buffer with existing tiling or vectorization passes inside BLAS libraries remains an open implementation question.

Load-bearing premise

The overhead of pointer chasing in the indirection buffer stays smaller than the cost of the im2col memory copy and layout change on the target hardware.

What would settle it

A direct timing and memory measurement on the same CPU that shows the indirection version using equal or greater total time and memory than standard im2col for kernels larger than 1x1.

Figures

Figures reproduced from arXiv: 1907.02129 by Marat Dukhan.

**Figure 3.** Figure 3: Performance of the Indirect Convolution algorithm and GEMM-based Algorithm on convolution operators of the ResNet-18 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Performance of the Indirect Convolution algorithm and GEMM-based Algorithm on convolution operators of the SqueezeNet 1.0 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Deep learning frameworks commonly implement convolution operators with GEMM-based algorithms. In these algorithms, convolution is implemented on top of matrix-matrix multiplication (GEMM) functions, provided by highly optimized BLAS libraries. Convolutions with 1x1 kernels can be directly represented as a GEMM call, but convolutions with larger kernels require a special memory layout transformation - im2col or im2row - to fit into GEMM interface. The Indirect Convolution algorithm provides the efficiency of the GEMM primitive without the overhead of im2col transformation. In contrast to GEMM-based algorithms, the Indirect Convolution does not reshuffle the data to fit into the GEMM primitive but introduces an indirection buffer - a buffer of pointers to the start of each row of image pixels. This broadens the application of our modified GEMM function to convolutions with arbitrary kernel size, padding, stride, and dilation. The Indirect Convolution algorithm reduces memory overhead proportionally to the number of input channels and outperforms the GEMM-based algorithm by up to 62% on convolution parameters which involve im2col transformations in GEMM-based algorithms. This, however, comes at cost of minor performance reduction on 1x1 stride-1 convolutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows how an indirection buffer of row pointers can let a modified GEMM handle arbitrary convolutions without im2col, but the performance edge still needs numbers that separate pointer cost from the saved copy.

read the letter

The core idea is straightforward: instead of copying image patches into a contiguous matrix for GEMM, keep an array of pointers to the original rows and adjust the GEMM to follow them. This removes the memory traffic that grows with input channels and extends the GEMM path to any kernel size, padding, stride, or dilation. That is the actual novelty relative to the usual direct 1x1 or im2col routes described in the abstract. The approach is practical and targets a real cost inside DL frameworks that rely on BLAS GEMM. Credit is due for spelling out the memory scaling and for noting the small regression on the 1x1 stride-1 case where im2col is cheap anyway. The description is clear enough that an implementer could sketch the change. The main gap is evidence. The abstract states up to 62% speedup on im2col-heavy cases but supplies no tables, no hardware details, and no micro-benchmarks that isolate pointer-chasing latency, cache misses, or lost vectorization against a standard GEMM. The stress-test concern about non-contiguous access overhead is therefore still open; if the full paper contains those measurements it would close the loop, otherwise the central claim rests on unshown data. Minor implementation details such as how prefetching and tiling are preserved in the modified GEMM are also missing from what is visible. This is aimed at engineers who maintain convolution back-ends or tune memory-bound layers rather than theorists. A reader already working on GEMM wrappers or framework kernels would find the pointer-buffer trick worth testing even if the speedups need confirmation. It is solid enough on its own terms to merit referee time; the idea is testable and the literature context is standard, so I would send it out rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes the Indirect Convolution algorithm as an alternative to standard GEMM-based convolution implementations in deep learning frameworks. Instead of using im2col/im2row to reshape input data for GEMM, it introduces an indirection buffer (array of pointers to image rows) that allows a modified GEMM routine to handle convolutions with arbitrary kernel size, padding, stride, and dilation. The central claims are that this reduces memory overhead proportionally to the number of input channels and yields up to 62% speedup versus GEMM-based methods on convolutions that require im2col, at the cost of minor slowdown on 1x1 stride-1 cases.

Significance. If the performance and memory claims hold after accounting for indirection overhead, the work would provide a practical way to extend highly optimized BLAS GEMM primitives to a wider range of convolution parameters without data copying, which could benefit memory-constrained inference and training in DL frameworks. The approach is a direct, implementable alternative rather than a wholly new primitive.

major comments (2)

[Abstract] Abstract: The performance claim of 'outperforms the GEMM-based algorithm by up to 62%' is presented without any reference to hardware platform, specific convolution parameters (kernel size, channels, batch), baseline implementation, or measured metrics; this is load-bearing because the central contribution is the claimed net speedup after introducing the indirection buffer.
[Abstract / Algorithm] Algorithm description (implied in abstract and methods): No micro-benchmark or analysis isolates the overhead of pointer chasing and non-contiguous memory accesses in the modified GEMM versus a standard contiguous GEMM; this directly tests the weakest assumption that indirection does not offset the avoided im2col copy cost.

minor comments (1)

[Abstract] The abstract states the memory reduction is 'proportionally to the number of input channels' but does not define the exact proportionality or provide the formula relating buffer size to C_in.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity on experimental conditions and analysis.

read point-by-point responses

Referee: [Abstract] Abstract: The performance claim of 'outperforms the GEMM-based algorithm by up to 62%' is presented without any reference to hardware platform, specific convolution parameters (kernel size, channels, batch), baseline implementation, or measured metrics; this is load-bearing because the central contribution is the claimed net speedup after introducing the indirection buffer.

Authors: We agree that the abstract would benefit from greater specificity. The 62% figure was measured on Intel Xeon processors for convolutions with kernel sizes >1x1 (e.g., 3x3), input channels 64-512, batch size 1, and stride-1, using a standard im2col+GEMM baseline from a common DL framework. Full parameters and metrics appear in the experimental section. We will revise the abstract to reference the hardware platform and the class of convolutions requiring im2col. revision: yes
Referee: [Abstract / Algorithm] Algorithm description (implied in abstract and methods): No micro-benchmark or analysis isolates the overhead of pointer chasing and non-contiguous memory accesses in the modified GEMM versus a standard contiguous GEMM; this directly tests the weakest assumption that indirection does not offset the avoided im2col copy cost.

Authors: This observation is fair. While aggregate results show net gains from avoiding im2col, an isolated comparison would better quantify the indirection cost. We will add micro-benchmark results in the revised manuscript comparing the modified GEMM (with indirection buffer) to a standard contiguous GEMM on identical matrix sizes, reporting the overhead percentage attributable to pointer chasing and non-contiguous loads. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithm presented as direct alternative without self-referential derivations

full rationale

The paper describes the Indirect Convolution algorithm as an alternative to GEMM+im2col methods by introducing an indirection buffer of pointers. No equations, predictions, fitted parameters, or uniqueness theorems are claimed. No self-citations appear in the abstract or description. The performance claims are empirical comparisons, not reductions to inputs by construction. This matches the default expectation of a non-circular algorithmic paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the invention of the indirection buffer and the domain assumption about GEMM availability and efficiency.

axioms (1)

domain assumption Existence of highly optimized BLAS GEMM libraries that can be modified for indirection.
The algorithm builds on top of GEMM primitives provided by BLAS libraries.

invented entities (1)

indirection buffer no independent evidence
purpose: Buffer of pointers to the start of each row of image pixels to enable indirect access in GEMM.
New concept introduced to enable the algorithm without data reshuffling.

pith-pipeline@v0.9.0 · 5738 in / 1271 out tokens · 48866 ms · 2026-05-25T10:00:14.291283+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 10 internal anchors

[1]

Tensorﬂow: A system for large-scale machine learning

Mart ´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe- mawat, Geoffrey Irving, Michael Isard, et al. Tensorﬂow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementa- tion ({OSDI} 16), pages 265–283, 2016

work page 2016
[2]

Low-memory GEMM-based convolution algorithms for deep neural networks

Andrew Anderson, Aravind Vasudevan, Cormac Keane, and David Gregg. Low-memory GEMM-based convolu- tion algorithms for deep neural networks. arXiv preprint arXiv:1709.03395, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

High per- formance convolutional neural networks for document pro- cessing

Kumar Chellapilla, Sidd Puri, and Patrice Simard. High per- formance convolutional neural networks for document pro- cessing. In Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft, 2006

work page 2006
[4]

Encoder-decoder with atrous separable convolution for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 801–818, 2018

work page 2018
[5]

{TVM}: An automated end-to-end optimizing compiler for deep learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. {TVM}: An automated end-to-end optimizing compiler for deep learning. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), pages 578–594, 2018

work page 2018
[6]

cuDNN: Efficient Primitives for Deep Learning

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efﬁcient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[7]

Mec: memory-efﬁcient con- volution for deep neural network

Minsik Cho and Daniel Brand. Mec: memory-efﬁcient con- volution for deep neural network. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 815–824. JMLR. org, 2017

work page 2017
[8]

Xception: Deep learning with depthwise separable convolutions

Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 1251–1258, 2017

work page 2017
[9]

Language modeling with gated convolutional net- works

Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional net- works. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 933–941. JMLR. org, 2017

work page 2017
[10]

QNNPACK: open source library for optimized mobile deep learn- ing

Marat Dukhan, Yiming Wu, and Hao Lu. QNNPACK: open source library for optimized mobile deep learn- ing. https://code.fb.com/ml-applications/ qnnpack/. [Online; accessed 8-April-2019]

work page 2019
[11]

Anatomy of high-performance deep learning con- volutions on simd architectures

Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst, and Alexander Heinecke. Anatomy of high-performance deep learning con- volutions on simd architectures. InSC18: International Con- ference for High Performance Computing, Networking, Stor- age and Analysis, pages 830–841. IEEE, 2018

work page 2018
[12]

Anatomy of high- performance matrix multiplication

Kazushige Goto and Robert A Geijn. Anatomy of high- performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS), 34(3):12, 2008

work page 2008
[13]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017

work page 2017
[14]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[15]

LIBXSMM: accelerating small matrix multi- plications by runtime code generation

Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. LIBXSMM: accelerating small matrix multi- plications by runtime code generation. In Proceedings of the International Conference for High Performance Comput- ing, Networking, Storage and Analysis, page 84. IEEE Press, 2016

work page 2016
[16]

Squeeze-and-excitation net- works

Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net- works. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018

work page 2018
[17]

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer pa- rameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[18]

Caffe: Convolutional architecture for fast feature embedding

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM inter- national conference on Multimedia , pages 675–678. ACM, 2014

work page 2014
[19]

Pooling Pyramid Network for Object Detection

Pengchong Jin, Vivek Rathod, and Xiangxin Zhu. Pool- ing pyramid network for object detection. arXiv preprint arXiv:1807.03284, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Panoptic Feature Pyramid Networks

Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Doll´ar. Panoptic feature pyramid networks. arXiv preprint arXiv:1901.02446, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[21]

Fast algorithms for convo- lutional neural networks

Andrew Lavin and Scott Gray. Fast algorithms for convo- lutional neural networks. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 4013–4021, 2016

work page 2016
[22]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Pro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

work page 2017
[23]

Shufﬂenet v2: Practical guidelines for efﬁcient CNN archi- tecture design

Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufﬂenet v2: Practical guidelines for efﬁcient CNN archi- tecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pages 116–131, 2018. 9

work page 2018
[24]

Automatic differentiation in PyTorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al- ban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. 2017

work page 2017
[25]

YOLOv3: An Incremental Improvement

Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

Mobilenetv2: Inverted residuals and linear bottlenecks

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 4510–4520, 2018

work page 2018
[27]

Wavenet: A generative model for raw audio

A ¨aron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. SSW, 125, 2016

work page 2016
[28]

Blis: A frame- work for rapidly instantiating BLAS functionality

Field G Van Zee and Robert A Van De Geijn. Blis: A frame- work for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software (TOMS) , 41(3):14, 2015

work page 2015
[29]

Fast Convolutional Nets With fbfft: A GPU Performance Evaluation

Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. Fast convolu- tional nets with fbfft: A gpu performance evaluation. arXiv preprint arXiv:1412.7580, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[30]

Rooﬂine: An insightful visual performance model for ﬂoating-point programs and multicore architectures

Samuel Williams, Andrew Waterman, and David Patter- son. Rooﬂine: An insightful visual performance model for ﬂoating-point programs and multicore architectures. Techni- cal report, Lawrence Berkeley National Lab.(LBNL), Berke- ley, CA (United States), 2009

work page 2009
[31]

FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search

Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efﬁcient convnet design via differentiable neural architecture search. CoRR, abs/1812.03443, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

Pay Less Attention with Lightweight and Dynamic Convolutions

Felix Wu, Angela Fan, Alexei Baevski, Yann N Dauphin, and Michael Auli. Pay less attention with lightweight and dy- namic convolutions. arXiv preprint arXiv:1901.10430, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[33]

Xiaomi. MACE. https://github.com/XiaoMi/ mace. [Online; accessed 8-April-2019]

work page 2019
[34]

Aggregated residual transformations for deep neural networks

Saining Xie, Ross Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017

work page 2017
[35]

High Performance Zero-Memory Overhead Direct Convolutions

Jiyuan Zhang, Franz Franchetti, and Tze Meng Low. High performance zero-memory overhead direct convolutions. arXiv preprint arXiv:1809.10170, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

Learning transferable architectures for scalable image recognition

Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018. 10

work page 2018

[1] [1]

Tensorﬂow: A system for large-scale machine learning

Mart ´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe- mawat, Geoffrey Irving, Michael Isard, et al. Tensorﬂow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementa- tion ({OSDI} 16), pages 265–283, 2016

work page 2016

[2] [2]

Low-memory GEMM-based convolution algorithms for deep neural networks

Andrew Anderson, Aravind Vasudevan, Cormac Keane, and David Gregg. Low-memory GEMM-based convolu- tion algorithms for deep neural networks. arXiv preprint arXiv:1709.03395, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

High per- formance convolutional neural networks for document pro- cessing

Kumar Chellapilla, Sidd Puri, and Patrice Simard. High per- formance convolutional neural networks for document pro- cessing. In Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft, 2006

work page 2006

[4] [4]

Encoder-decoder with atrous separable convolution for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 801–818, 2018

work page 2018

[5] [5]

{TVM}: An automated end-to-end optimizing compiler for deep learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. {TVM}: An automated end-to-end optimizing compiler for deep learning. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), pages 578–594, 2018

work page 2018

[6] [6]

cuDNN: Efficient Primitives for Deep Learning

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efﬁcient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[7] [7]

Mec: memory-efﬁcient con- volution for deep neural network

Minsik Cho and Daniel Brand. Mec: memory-efﬁcient con- volution for deep neural network. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 815–824. JMLR. org, 2017

work page 2017

[8] [8]

Xception: Deep learning with depthwise separable convolutions

Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 1251–1258, 2017

work page 2017

[9] [9]

Language modeling with gated convolutional net- works

Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional net- works. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 933–941. JMLR. org, 2017

work page 2017

[10] [10]

QNNPACK: open source library for optimized mobile deep learn- ing

Marat Dukhan, Yiming Wu, and Hao Lu. QNNPACK: open source library for optimized mobile deep learn- ing. https://code.fb.com/ml-applications/ qnnpack/. [Online; accessed 8-April-2019]

work page 2019

[11] [11]

Anatomy of high-performance deep learning con- volutions on simd architectures

Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst, and Alexander Heinecke. Anatomy of high-performance deep learning con- volutions on simd architectures. InSC18: International Con- ference for High Performance Computing, Networking, Stor- age and Analysis, pages 830–841. IEEE, 2018

work page 2018

[12] [12]

Anatomy of high- performance matrix multiplication

Kazushige Goto and Robert A Geijn. Anatomy of high- performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS), 34(3):12, 2008

work page 2008

[13] [13]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017

work page 2017

[14] [14]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[15] [15]

LIBXSMM: accelerating small matrix multi- plications by runtime code generation

Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. LIBXSMM: accelerating small matrix multi- plications by runtime code generation. In Proceedings of the International Conference for High Performance Comput- ing, Networking, Storage and Analysis, page 84. IEEE Press, 2016

work page 2016

[16] [16]

Squeeze-and-excitation net- works

Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net- works. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018

work page 2018

[17] [17]

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer pa- rameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[18] [18]

Caffe: Convolutional architecture for fast feature embedding

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM inter- national conference on Multimedia , pages 675–678. ACM, 2014

work page 2014

[19] [19]

Pooling Pyramid Network for Object Detection

Pengchong Jin, Vivek Rathod, and Xiangxin Zhu. Pool- ing pyramid network for object detection. arXiv preprint arXiv:1807.03284, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Panoptic Feature Pyramid Networks

Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Doll´ar. Panoptic feature pyramid networks. arXiv preprint arXiv:1901.02446, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[21] [21]

Fast algorithms for convo- lutional neural networks

Andrew Lavin and Scott Gray. Fast algorithms for convo- lutional neural networks. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 4013–4021, 2016

work page 2016

[22] [22]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Pro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

work page 2017

[23] [23]

Shufﬂenet v2: Practical guidelines for efﬁcient CNN archi- tecture design

Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufﬂenet v2: Practical guidelines for efﬁcient CNN archi- tecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pages 116–131, 2018. 9

work page 2018

[24] [24]

Automatic differentiation in PyTorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al- ban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. 2017

work page 2017

[25] [25]

YOLOv3: An Incremental Improvement

Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[26] [26]

Mobilenetv2: Inverted residuals and linear bottlenecks

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 4510–4520, 2018

work page 2018

[27] [27]

Wavenet: A generative model for raw audio

A ¨aron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. SSW, 125, 2016

work page 2016

[28] [28]

Blis: A frame- work for rapidly instantiating BLAS functionality

Field G Van Zee and Robert A Van De Geijn. Blis: A frame- work for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software (TOMS) , 41(3):14, 2015

work page 2015

[29] [29]

Fast Convolutional Nets With fbfft: A GPU Performance Evaluation

Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. Fast convolu- tional nets with fbfft: A gpu performance evaluation. arXiv preprint arXiv:1412.7580, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[30] [30]

Rooﬂine: An insightful visual performance model for ﬂoating-point programs and multicore architectures

Samuel Williams, Andrew Waterman, and David Patter- son. Rooﬂine: An insightful visual performance model for ﬂoating-point programs and multicore architectures. Techni- cal report, Lawrence Berkeley National Lab.(LBNL), Berke- ley, CA (United States), 2009

work page 2009

[31] [31]

FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search

Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efﬁcient convnet design via differentiable neural architecture search. CoRR, abs/1812.03443, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[32] [32]

Pay Less Attention with Lightweight and Dynamic Convolutions

Felix Wu, Angela Fan, Alexei Baevski, Yann N Dauphin, and Michael Auli. Pay less attention with lightweight and dy- namic convolutions. arXiv preprint arXiv:1901.10430, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[33] [33]

Xiaomi. MACE. https://github.com/XiaoMi/ mace. [Online; accessed 8-April-2019]

work page 2019

[34] [34]

Aggregated residual transformations for deep neural networks

Saining Xie, Ross Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017

work page 2017

[35] [35]

High Performance Zero-Memory Overhead Direct Convolutions

Jiyuan Zhang, Franz Franchetti, and Tze Meng Low. High performance zero-memory overhead direct convolutions. arXiv preprint arXiv:1809.10170, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[36] [36]

Learning transferable architectures for scalable image recognition

Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018. 10

work page 2018