The Indirect Convolution Algorithm
Pith reviewed 2026-05-25 10:00 UTC · model grok-4.3
The pith
The Indirect Convolution algorithm performs GEMM-based convolutions using an indirection buffer of pointers instead of an im2col data transformation, reducing memory overhead in proportion to input channels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Indirect Convolution algorithm provides the efficiency of the GEMM primitive without the overhead of im2col transformation. In contrast to GEMM-based algorithms, the Indirect Convolution does not reshuffle the data to fit into the GEMM primitive but introduces an indirection buffer - a buffer of pointers to the start of each row of image pixels. This broadens the application of the modified GEMM function to convolutions with arbitrary kernel size, padding, stride, and dilation. The Indirect Convolution algorithm reduces memory overhead proportionally to the number of input channels and outperforms the GEMM-based algorithm by up to 62% on convolution parameters which involve im2col in GEM
What carries the argument
Indirection buffer of pointers to image pixel rows, which lets a modified GEMM routine compute convolutions of any kernel size without first copying data into a packed matrix.
If this is right
- Memory overhead drops in direct proportion to the number of input channels.
- Performance improves by as much as 62 percent versus GEMM-plus-im2col on convolutions that require the transformation.
- The same GEMM routine now works for arbitrary kernel sizes, padding, stride, and dilation without extra layout code.
- A small slowdown appears only on 1x1 stride-1 convolutions that never needed im2col.
Where Pith is reading between the lines
- The method may deliver larger relative gains on memory-bandwidth-limited hardware such as mobile CPUs or embedded processors.
- Framework implementers could adopt it selectively for kernel sizes greater than 1x1 while retaining direct GEMM for 1x1 cases.
- The pointer-based access pattern could interact with cache prefetchers in ways that vary across CPU microarchitectures.
- Combining the indirection buffer with existing tiling or vectorization passes inside BLAS libraries remains an open implementation question.
Load-bearing premise
The overhead of pointer chasing in the indirection buffer stays smaller than the cost of the im2col memory copy and layout change on the target hardware.
What would settle it
A direct timing and memory measurement on the same CPU that shows the indirection version using equal or greater total time and memory than standard im2col for kernels larger than 1x1.
Figures
read the original abstract
Deep learning frameworks commonly implement convolution operators with GEMM-based algorithms. In these algorithms, convolution is implemented on top of matrix-matrix multiplication (GEMM) functions, provided by highly optimized BLAS libraries. Convolutions with 1x1 kernels can be directly represented as a GEMM call, but convolutions with larger kernels require a special memory layout transformation - im2col or im2row - to fit into GEMM interface. The Indirect Convolution algorithm provides the efficiency of the GEMM primitive without the overhead of im2col transformation. In contrast to GEMM-based algorithms, the Indirect Convolution does not reshuffle the data to fit into the GEMM primitive but introduces an indirection buffer - a buffer of pointers to the start of each row of image pixels. This broadens the application of our modified GEMM function to convolutions with arbitrary kernel size, padding, stride, and dilation. The Indirect Convolution algorithm reduces memory overhead proportionally to the number of input channels and outperforms the GEMM-based algorithm by up to 62% on convolution parameters which involve im2col transformations in GEMM-based algorithms. This, however, comes at cost of minor performance reduction on 1x1 stride-1 convolutions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the Indirect Convolution algorithm as an alternative to standard GEMM-based convolution implementations in deep learning frameworks. Instead of using im2col/im2row to reshape input data for GEMM, it introduces an indirection buffer (array of pointers to image rows) that allows a modified GEMM routine to handle convolutions with arbitrary kernel size, padding, stride, and dilation. The central claims are that this reduces memory overhead proportionally to the number of input channels and yields up to 62% speedup versus GEMM-based methods on convolutions that require im2col, at the cost of minor slowdown on 1x1 stride-1 cases.
Significance. If the performance and memory claims hold after accounting for indirection overhead, the work would provide a practical way to extend highly optimized BLAS GEMM primitives to a wider range of convolution parameters without data copying, which could benefit memory-constrained inference and training in DL frameworks. The approach is a direct, implementable alternative rather than a wholly new primitive.
major comments (2)
- [Abstract] Abstract: The performance claim of 'outperforms the GEMM-based algorithm by up to 62%' is presented without any reference to hardware platform, specific convolution parameters (kernel size, channels, batch), baseline implementation, or measured metrics; this is load-bearing because the central contribution is the claimed net speedup after introducing the indirection buffer.
- [Abstract / Algorithm] Algorithm description (implied in abstract and methods): No micro-benchmark or analysis isolates the overhead of pointer chasing and non-contiguous memory accesses in the modified GEMM versus a standard contiguous GEMM; this directly tests the weakest assumption that indirection does not offset the avoided im2col copy cost.
minor comments (1)
- [Abstract] The abstract states the memory reduction is 'proportionally to the number of input channels' but does not define the exact proportionality or provide the formula relating buffer size to C_in.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity on experimental conditions and analysis.
read point-by-point responses
-
Referee: [Abstract] Abstract: The performance claim of 'outperforms the GEMM-based algorithm by up to 62%' is presented without any reference to hardware platform, specific convolution parameters (kernel size, channels, batch), baseline implementation, or measured metrics; this is load-bearing because the central contribution is the claimed net speedup after introducing the indirection buffer.
Authors: We agree that the abstract would benefit from greater specificity. The 62% figure was measured on Intel Xeon processors for convolutions with kernel sizes >1x1 (e.g., 3x3), input channels 64-512, batch size 1, and stride-1, using a standard im2col+GEMM baseline from a common DL framework. Full parameters and metrics appear in the experimental section. We will revise the abstract to reference the hardware platform and the class of convolutions requiring im2col. revision: yes
-
Referee: [Abstract / Algorithm] Algorithm description (implied in abstract and methods): No micro-benchmark or analysis isolates the overhead of pointer chasing and non-contiguous memory accesses in the modified GEMM versus a standard contiguous GEMM; this directly tests the weakest assumption that indirection does not offset the avoided im2col copy cost.
Authors: This observation is fair. While aggregate results show net gains from avoiding im2col, an isolated comparison would better quantify the indirection cost. We will add micro-benchmark results in the revised manuscript comparing the modified GEMM (with indirection buffer) to a standard contiguous GEMM on identical matrix sizes, reporting the overhead percentage attributable to pointer chasing and non-contiguous loads. revision: yes
Circularity Check
No circularity: algorithm presented as direct alternative without self-referential derivations
full rationale
The paper describes the Indirect Convolution algorithm as an alternative to GEMM+im2col methods by introducing an indirection buffer of pointers. No equations, predictions, fitted parameters, or uniqueness theorems are claimed. No self-citations appear in the abstract or description. The performance claims are empirical comparisons, not reductions to inputs by construction. This matches the default expectation of a non-circular algorithmic paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existence of highly optimized BLAS GEMM libraries that can be modified for indirection.
invented entities (1)
-
indirection buffer
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Tensorflow: A system for large-scale machine learning
Mart ´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe- mawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementa- tion ({OSDI} 16), pages 265–283, 2016
work page 2016
-
[2]
Low-memory GEMM-based convolution algorithms for deep neural networks
Andrew Anderson, Aravind Vasudevan, Cormac Keane, and David Gregg. Low-memory GEMM-based convolu- tion algorithms for deep neural networks. arXiv preprint arXiv:1709.03395, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[3]
High per- formance convolutional neural networks for document pro- cessing
Kumar Chellapilla, Sidd Puri, and Patrice Simard. High per- formance convolutional neural networks for document pro- cessing. In Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft, 2006
work page 2006
-
[4]
Encoder-decoder with atrous separable convolution for semantic image segmentation
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 801–818, 2018
work page 2018
-
[5]
{TVM}: An automated end-to-end optimizing compiler for deep learning
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. {TVM}: An automated end-to-end optimizing compiler for deep learning. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), pages 578–594, 2018
work page 2018
-
[6]
cuDNN: Efficient Primitives for Deep Learning
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[7]
Mec: memory-efficient con- volution for deep neural network
Minsik Cho and Daniel Brand. Mec: memory-efficient con- volution for deep neural network. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 815–824. JMLR. org, 2017
work page 2017
-
[8]
Xception: Deep learning with depthwise separable convolutions
Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 1251–1258, 2017
work page 2017
-
[9]
Language modeling with gated convolutional net- works
Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional net- works. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 933–941. JMLR. org, 2017
work page 2017
-
[10]
QNNPACK: open source library for optimized mobile deep learn- ing
Marat Dukhan, Yiming Wu, and Hao Lu. QNNPACK: open source library for optimized mobile deep learn- ing. https://code.fb.com/ml-applications/ qnnpack/. [Online; accessed 8-April-2019]
work page 2019
-
[11]
Anatomy of high-performance deep learning con- volutions on simd architectures
Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst, and Alexander Heinecke. Anatomy of high-performance deep learning con- volutions on simd architectures. InSC18: International Con- ference for High Performance Computing, Networking, Stor- age and Analysis, pages 830–841. IEEE, 2018
work page 2018
-
[12]
Anatomy of high- performance matrix multiplication
Kazushige Goto and Robert A Geijn. Anatomy of high- performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS), 34(3):12, 2008
work page 2008
-
[13]
Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017
work page 2017
-
[14]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[15]
LIBXSMM: accelerating small matrix multi- plications by runtime code generation
Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. LIBXSMM: accelerating small matrix multi- plications by runtime code generation. In Proceedings of the International Conference for High Performance Comput- ing, Networking, Storage and Analysis, page 84. IEEE Press, 2016
work page 2016
-
[16]
Squeeze-and-excitation net- works
Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net- works. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018
work page 2018
-
[17]
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size
Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer pa- rameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[18]
Caffe: Convolutional architecture for fast feature embedding
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM inter- national conference on Multimedia , pages 675–678. ACM, 2014
work page 2014
-
[19]
Pooling Pyramid Network for Object Detection
Pengchong Jin, Vivek Rathod, and Xiangxin Zhu. Pool- ing pyramid network for object detection. arXiv preprint arXiv:1807.03284, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[20]
Panoptic Feature Pyramid Networks
Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Doll´ar. Panoptic feature pyramid networks. arXiv preprint arXiv:1901.02446, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[21]
Fast algorithms for convo- lutional neural networks
Andrew Lavin and Scott Gray. Fast algorithms for convo- lutional neural networks. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 4013–4021, 2016
work page 2016
-
[22]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Pro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017
work page 2017
-
[23]
Shufflenet v2: Practical guidelines for efficient CNN archi- tecture design
Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient CNN archi- tecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pages 116–131, 2018. 9
work page 2018
-
[24]
Automatic differentiation in PyTorch
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al- ban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. 2017
work page 2017
-
[25]
YOLOv3: An Incremental Improvement
Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[26]
Mobilenetv2: Inverted residuals and linear bottlenecks
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 4510–4520, 2018
work page 2018
-
[27]
Wavenet: A generative model for raw audio
A ¨aron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. SSW, 125, 2016
work page 2016
-
[28]
Blis: A frame- work for rapidly instantiating BLAS functionality
Field G Van Zee and Robert A Van De Geijn. Blis: A frame- work for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software (TOMS) , 41(3):14, 2015
work page 2015
-
[29]
Fast Convolutional Nets With fbfft: A GPU Performance Evaluation
Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. Fast convolu- tional nets with fbfft: A gpu performance evaluation. arXiv preprint arXiv:1412.7580, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[30]
Samuel Williams, Andrew Waterman, and David Patter- son. Roofline: An insightful visual performance model for floating-point programs and multicore architectures. Techni- cal report, Lawrence Berkeley National Lab.(LBNL), Berke- ley, CA (United States), 2009
work page 2009
-
[31]
FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search
Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. CoRR, abs/1812.03443, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[32]
Pay Less Attention with Lightweight and Dynamic Convolutions
Felix Wu, Angela Fan, Alexei Baevski, Yann N Dauphin, and Michael Auli. Pay less attention with lightweight and dy- namic convolutions. arXiv preprint arXiv:1901.10430, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[33]
Xiaomi. MACE. https://github.com/XiaoMi/ mace. [Online; accessed 8-April-2019]
work page 2019
-
[34]
Aggregated residual transformations for deep neural networks
Saining Xie, Ross Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017
work page 2017
-
[35]
High Performance Zero-Memory Overhead Direct Convolutions
Jiyuan Zhang, Franz Franchetti, and Tze Meng Low. High performance zero-memory overhead direct convolutions. arXiv preprint arXiv:1809.10170, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[36]
Learning transferable architectures for scalable image recognition
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018. 10
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.