pith. sign in

arxiv: 1907.02129 · v1 · pith:OWC2RKXKnew · submitted 2019-07-03 · 💻 cs.CV · cs.LG· cs.NE

The Indirect Convolution Algorithm

Pith reviewed 2026-05-25 10:00 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.NE
keywords convolutionGEMMim2coldeep learningcomputer visionindirection buffermemory optimizationperformance
0
0 comments X

The pith

The Indirect Convolution algorithm performs GEMM-based convolutions using an indirection buffer of pointers instead of an im2col data transformation, reducing memory overhead in proportion to input channels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep learning frameworks commonly convert convolutions into matrix multiplications via GEMM calls from BLAS libraries, but kernels larger than 1x1 require an im2col step that copies and rearranges image data into a temporary matrix. The paper presents the Indirect Convolution method, which avoids this copy by building a compact buffer of pointers to the original image rows and adapting the GEMM routine to read through those pointers. This change preserves the performance of highly tuned matrix multiplication while eliminating the extra memory traffic and storage. A reader would care because memory bandwidth limits many convolution workloads in practice, and the savings grow directly with the number of input channels.

Core claim

The Indirect Convolution algorithm provides the efficiency of the GEMM primitive without the overhead of im2col transformation. In contrast to GEMM-based algorithms, the Indirect Convolution does not reshuffle the data to fit into the GEMM primitive but introduces an indirection buffer - a buffer of pointers to the start of each row of image pixels. This broadens the application of the modified GEMM function to convolutions with arbitrary kernel size, padding, stride, and dilation. The Indirect Convolution algorithm reduces memory overhead proportionally to the number of input channels and outperforms the GEMM-based algorithm by up to 62% on convolution parameters which involve im2col in GEM

What carries the argument

Indirection buffer of pointers to image pixel rows, which lets a modified GEMM routine compute convolutions of any kernel size without first copying data into a packed matrix.

If this is right

  • Memory overhead drops in direct proportion to the number of input channels.
  • Performance improves by as much as 62 percent versus GEMM-plus-im2col on convolutions that require the transformation.
  • The same GEMM routine now works for arbitrary kernel sizes, padding, stride, and dilation without extra layout code.
  • A small slowdown appears only on 1x1 stride-1 convolutions that never needed im2col.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may deliver larger relative gains on memory-bandwidth-limited hardware such as mobile CPUs or embedded processors.
  • Framework implementers could adopt it selectively for kernel sizes greater than 1x1 while retaining direct GEMM for 1x1 cases.
  • The pointer-based access pattern could interact with cache prefetchers in ways that vary across CPU microarchitectures.
  • Combining the indirection buffer with existing tiling or vectorization passes inside BLAS libraries remains an open implementation question.

Load-bearing premise

The overhead of pointer chasing in the indirection buffer stays smaller than the cost of the im2col memory copy and layout change on the target hardware.

What would settle it

A direct timing and memory measurement on the same CPU that shows the indirection version using equal or greater total time and memory than standard im2col for kernels larger than 1x1.

Figures

Figures reproduced from arXiv: 1907.02129 by Marat Dukhan.

Figure 1
Figure 1. Figure 1: GEMM operation as a component of GEMM-based con [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance of the Indirect Convolution algorithm and GEMM-based Algorithm on convolution operators of the ResNet-18 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance of the Indirect Convolution algorithm and GEMM-based Algorithm on convolution operators of the SqueezeNet 1.0 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Deep learning frameworks commonly implement convolution operators with GEMM-based algorithms. In these algorithms, convolution is implemented on top of matrix-matrix multiplication (GEMM) functions, provided by highly optimized BLAS libraries. Convolutions with 1x1 kernels can be directly represented as a GEMM call, but convolutions with larger kernels require a special memory layout transformation - im2col or im2row - to fit into GEMM interface. The Indirect Convolution algorithm provides the efficiency of the GEMM primitive without the overhead of im2col transformation. In contrast to GEMM-based algorithms, the Indirect Convolution does not reshuffle the data to fit into the GEMM primitive but introduces an indirection buffer - a buffer of pointers to the start of each row of image pixels. This broadens the application of our modified GEMM function to convolutions with arbitrary kernel size, padding, stride, and dilation. The Indirect Convolution algorithm reduces memory overhead proportionally to the number of input channels and outperforms the GEMM-based algorithm by up to 62% on convolution parameters which involve im2col transformations in GEMM-based algorithms. This, however, comes at cost of minor performance reduction on 1x1 stride-1 convolutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes the Indirect Convolution algorithm as an alternative to standard GEMM-based convolution implementations in deep learning frameworks. Instead of using im2col/im2row to reshape input data for GEMM, it introduces an indirection buffer (array of pointers to image rows) that allows a modified GEMM routine to handle convolutions with arbitrary kernel size, padding, stride, and dilation. The central claims are that this reduces memory overhead proportionally to the number of input channels and yields up to 62% speedup versus GEMM-based methods on convolutions that require im2col, at the cost of minor slowdown on 1x1 stride-1 cases.

Significance. If the performance and memory claims hold after accounting for indirection overhead, the work would provide a practical way to extend highly optimized BLAS GEMM primitives to a wider range of convolution parameters without data copying, which could benefit memory-constrained inference and training in DL frameworks. The approach is a direct, implementable alternative rather than a wholly new primitive.

major comments (2)
  1. [Abstract] Abstract: The performance claim of 'outperforms the GEMM-based algorithm by up to 62%' is presented without any reference to hardware platform, specific convolution parameters (kernel size, channels, batch), baseline implementation, or measured metrics; this is load-bearing because the central contribution is the claimed net speedup after introducing the indirection buffer.
  2. [Abstract / Algorithm] Algorithm description (implied in abstract and methods): No micro-benchmark or analysis isolates the overhead of pointer chasing and non-contiguous memory accesses in the modified GEMM versus a standard contiguous GEMM; this directly tests the weakest assumption that indirection does not offset the avoided im2col copy cost.
minor comments (1)
  1. [Abstract] The abstract states the memory reduction is 'proportionally to the number of input channels' but does not define the exact proportionality or provide the formula relating buffer size to C_in.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity on experimental conditions and analysis.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The performance claim of 'outperforms the GEMM-based algorithm by up to 62%' is presented without any reference to hardware platform, specific convolution parameters (kernel size, channels, batch), baseline implementation, or measured metrics; this is load-bearing because the central contribution is the claimed net speedup after introducing the indirection buffer.

    Authors: We agree that the abstract would benefit from greater specificity. The 62% figure was measured on Intel Xeon processors for convolutions with kernel sizes >1x1 (e.g., 3x3), input channels 64-512, batch size 1, and stride-1, using a standard im2col+GEMM baseline from a common DL framework. Full parameters and metrics appear in the experimental section. We will revise the abstract to reference the hardware platform and the class of convolutions requiring im2col. revision: yes

  2. Referee: [Abstract / Algorithm] Algorithm description (implied in abstract and methods): No micro-benchmark or analysis isolates the overhead of pointer chasing and non-contiguous memory accesses in the modified GEMM versus a standard contiguous GEMM; this directly tests the weakest assumption that indirection does not offset the avoided im2col copy cost.

    Authors: This observation is fair. While aggregate results show net gains from avoiding im2col, an isolated comparison would better quantify the indirection cost. We will add micro-benchmark results in the revised manuscript comparing the modified GEMM (with indirection buffer) to a standard contiguous GEMM on identical matrix sizes, reporting the overhead percentage attributable to pointer chasing and non-contiguous loads. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithm presented as direct alternative without self-referential derivations

full rationale

The paper describes the Indirect Convolution algorithm as an alternative to GEMM+im2col methods by introducing an indirection buffer of pointers. No equations, predictions, fitted parameters, or uniqueness theorems are claimed. No self-citations appear in the abstract or description. The performance claims are empirical comparisons, not reductions to inputs by construction. This matches the default expectation of a non-circular algorithmic paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the invention of the indirection buffer and the domain assumption about GEMM availability and efficiency.

axioms (1)
  • domain assumption Existence of highly optimized BLAS GEMM libraries that can be modified for indirection.
    The algorithm builds on top of GEMM primitives provided by BLAS libraries.
invented entities (1)
  • indirection buffer no independent evidence
    purpose: Buffer of pointers to the start of each row of image pixels to enable indirect access in GEMM.
    New concept introduced to enable the algorithm without data reshuffling.

pith-pipeline@v0.9.0 · 5738 in / 1271 out tokens · 48866 ms · 2026-05-25T10:00:14.291283+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 10 internal anchors

  1. [1]

    Tensorflow: A system for large-scale machine learning

    Mart ´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe- mawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementa- tion ({OSDI} 16), pages 265–283, 2016

  2. [2]

    Low-memory GEMM-based convolution algorithms for deep neural networks

    Andrew Anderson, Aravind Vasudevan, Cormac Keane, and David Gregg. Low-memory GEMM-based convolu- tion algorithms for deep neural networks. arXiv preprint arXiv:1709.03395, 2017

  3. [3]

    High per- formance convolutional neural networks for document pro- cessing

    Kumar Chellapilla, Sidd Puri, and Patrice Simard. High per- formance convolutional neural networks for document pro- cessing. In Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft, 2006

  4. [4]

    Encoder-decoder with atrous separable convolution for semantic image segmentation

    Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 801–818, 2018

  5. [5]

    {TVM}: An automated end-to-end optimizing compiler for deep learning

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. {TVM}: An automated end-to-end optimizing compiler for deep learning. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), pages 578–594, 2018

  6. [6]

    cuDNN: Efficient Primitives for Deep Learning

    Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014

  7. [7]

    Mec: memory-efficient con- volution for deep neural network

    Minsik Cho and Daniel Brand. Mec: memory-efficient con- volution for deep neural network. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 815–824. JMLR. org, 2017

  8. [8]

    Xception: Deep learning with depthwise separable convolutions

    Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 1251–1258, 2017

  9. [9]

    Language modeling with gated convolutional net- works

    Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional net- works. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 933–941. JMLR. org, 2017

  10. [10]

    QNNPACK: open source library for optimized mobile deep learn- ing

    Marat Dukhan, Yiming Wu, and Hao Lu. QNNPACK: open source library for optimized mobile deep learn- ing. https://code.fb.com/ml-applications/ qnnpack/. [Online; accessed 8-April-2019]

  11. [11]

    Anatomy of high-performance deep learning con- volutions on simd architectures

    Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst, and Alexander Heinecke. Anatomy of high-performance deep learning con- volutions on simd architectures. InSC18: International Con- ference for High Performance Computing, Networking, Stor- age and Analysis, pages 830–841. IEEE, 2018

  12. [12]

    Anatomy of high- performance matrix multiplication

    Kazushige Goto and Robert A Geijn. Anatomy of high- performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS), 34(3):12, 2008

  13. [13]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017

  14. [14]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  15. [15]

    LIBXSMM: accelerating small matrix multi- plications by runtime code generation

    Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. LIBXSMM: accelerating small matrix multi- plications by runtime code generation. In Proceedings of the International Conference for High Performance Comput- ing, Networking, Storage and Analysis, page 84. IEEE Press, 2016

  16. [16]

    Squeeze-and-excitation net- works

    Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net- works. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018

  17. [17]

    SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

    Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer pa- rameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016

  18. [18]

    Caffe: Convolutional architecture for fast feature embedding

    Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM inter- national conference on Multimedia , pages 675–678. ACM, 2014

  19. [19]

    Pooling Pyramid Network for Object Detection

    Pengchong Jin, Vivek Rathod, and Xiangxin Zhu. Pool- ing pyramid network for object detection. arXiv preprint arXiv:1807.03284, 2018

  20. [20]

    Panoptic Feature Pyramid Networks

    Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Doll´ar. Panoptic feature pyramid networks. arXiv preprint arXiv:1901.02446, 2019

  21. [21]

    Fast algorithms for convo- lutional neural networks

    Andrew Lavin and Scott Gray. Fast algorithms for convo- lutional neural networks. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 4013–4021, 2016

  22. [22]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Pro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

  23. [23]

    Shufflenet v2: Practical guidelines for efficient CNN archi- tecture design

    Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient CNN archi- tecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pages 116–131, 2018. 9

  24. [24]

    Automatic differentiation in PyTorch

    Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al- ban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. 2017

  25. [25]

    YOLOv3: An Incremental Improvement

    Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018

  26. [26]

    Mobilenetv2: Inverted residuals and linear bottlenecks

    Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 4510–4520, 2018

  27. [27]

    Wavenet: A generative model for raw audio

    A ¨aron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. SSW, 125, 2016

  28. [28]

    Blis: A frame- work for rapidly instantiating BLAS functionality

    Field G Van Zee and Robert A Van De Geijn. Blis: A frame- work for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software (TOMS) , 41(3):14, 2015

  29. [29]

    Fast Convolutional Nets With fbfft: A GPU Performance Evaluation

    Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. Fast convolu- tional nets with fbfft: A gpu performance evaluation. arXiv preprint arXiv:1412.7580, 2014

  30. [30]

    Roofline: An insightful visual performance model for floating-point programs and multicore architectures

    Samuel Williams, Andrew Waterman, and David Patter- son. Roofline: An insightful visual performance model for floating-point programs and multicore architectures. Techni- cal report, Lawrence Berkeley National Lab.(LBNL), Berke- ley, CA (United States), 2009

  31. [31]

    FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search

    Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. CoRR, abs/1812.03443, 2018

  32. [32]

    Pay Less Attention with Lightweight and Dynamic Convolutions

    Felix Wu, Angela Fan, Alexei Baevski, Yann N Dauphin, and Michael Auli. Pay less attention with lightweight and dy- namic convolutions. arXiv preprint arXiv:1901.10430, 2019

  33. [33]

    Xiaomi. MACE. https://github.com/XiaoMi/ mace. [Online; accessed 8-April-2019]

  34. [34]

    Aggregated residual transformations for deep neural networks

    Saining Xie, Ross Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017

  35. [35]

    High Performance Zero-Memory Overhead Direct Convolutions

    Jiyuan Zhang, Franz Franchetti, and Tze Meng Low. High performance zero-memory overhead direct convolutions. arXiv preprint arXiv:1809.10170, 2018

  36. [36]

    Learning transferable architectures for scalable image recognition

    Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018. 10