pith. machine review for the scientific record. sign in

arxiv: 2605.00174 · v1 · submitted 2026-04-30 · 💻 cs.AR · cs.CV

Recognition: unknown

DPU or GPU for Accelerating Neural Networks Inference -- Why not both? Split CNN Inference

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:47 UTC · model grok-4.3

classification 💻 cs.AR cs.CV
keywords CNN inferenceDPU GPU splitasynchronous pipelininggraph neural networklatency reductionmodel partitioningedge accelerationResNet VGG
0
0 comments X

The pith

Splitting CNN layers between DPU and GPU with asynchronous pipelining reduces inference latency by up to 3.37 times compared to GPU-only runs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that early CNN layers can run on a DPU near the input source while later layers run on a GPU, with the two devices executing asynchronously to overlap computation and reduce data movement. A graph neural network predicts the best layer to split at, achieving 96.27 percent accuracy across tested models without manual tuning. Experiments on networks such as LeNet-5, ResNet variants, VGG16, and MobileNetv2 show concrete gains of up to 2.48 times versus DPU alone. This hybrid setup targets low-latency requirements for video and image streaming on edge devices where single-accelerator approaches fall short. The automation via the predictor makes the partitioning practical for developers working with established CNN architectures.

Core claim

Partitioning CNN inference across the AI engines of a Versal VCK190 DPU for the first layers and an NVIDIA RTX 2080 GPU for the remaining layers, executed in an asynchronous pipeline, yields lower end-to-end latency than running the full network on either device alone. The approach further includes a trained graph neural network that selects the partition index for any given model, reaching 96.27 percent accuracy on the evaluated networks including LeNet-5, ResNet18/50/101/152, VGG16, and MobileNetv2.

What carries the argument

Split CNN Inference, the asynchronous pipeline that assigns initial layers to the DPU near the data source and remaining layers to the GPU while a graph neural network predicts the split point.

Load-bearing premise

The reported latency gains hold for the specific Versal VCK190 plus RTX 2080 hardware pair and the listed CNN models without requiring retraining of the partition predictor for new networks or input sizes.

What would settle it

Measuring higher end-to-end latency on the proposed split pipeline than on the faster of the two single-device baselines for ResNet50 or VGG16 on the same Versal VCK190 and RTX 2080 hardware.

Figures

Figures reproduced from arXiv: 2605.00174 by Ali Emre Oztas, James Garside, Mahir Demir, Mikel Luj'an.

Figure 1
Figure 1. Figure 1: Example of an edge system for image processing with DPU and GPU. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example of optimal partitioning of LeNet-5 infer [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of increasing parallelism and reducing latency with Split CNN inference. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of proposed GNN models to (a) predict model inference latencies or (b) partition index. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Exploration of data transfer latency between DPU and host DRAM across partitions. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Exploration of data size transferred between DPU and host DRAM across partitions. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Latency of VGG16 inference across different partitions. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ground truth and predicted latency results for a 42-layer CNN model from the test set. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ground truth and predicted latency results for the VGG16 model. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
read the original abstract

Video and image streaming on edge devices requires low latency. To address this, Neural Networks (NNs) are widely used, and prior work mainly focuses on accelerating them with single hardware units such as Graphics Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), and Deep Learning Processing Units (DPUs). However, further reductions in latency can be observed by combining these units. In this paper, partitioning CNN inference across DPU and GPU (Split CNN Inference) is proposed. The first partition runs on the AI engines (DPU) of a Versal VCK190, which consists of initial CNN layers processing the input images. The DPU processes the first partition near the source of the data. Pipelined asynchronously, a GPU runs the remaining layers. The GPU (NVIDIA RTX 2080) processes the second partition, albeit having reduced the data transfer between the data source (storage/camera) and the GPU. Furthermore, a Graph Neural Network (GNN)-based partition index prediction method is proposed to automate the partitioning of CNNs needed for the Split Inference. Well established models such as LeNet-5, ResNet18/50/101/152, VGG16, and MobileNetv2 are analyzed. Results demonstrate up to 2.48x latency improvement over DPU-only execution and up to 3.37x over GPU-only execution. The trained GNN model splits the layers between the appropriate devices with 96.27% accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes partitioning CNN inference across a DPU on the Versal VCK190 (initial layers on AI Engines) and an RTX 2080 GPU (remaining layers), executed asynchronously in a pipeline to reduce data movement. A Graph Neural Network is trained to predict the optimal layer index at which to split a given CNN. Experiments on LeNet-5, ResNet18/50/101/152, VGG16 and MobileNetv2 report up to 2.48× latency reduction versus DPU-only execution and 3.37× versus GPU-only execution, together with 96.27 % accuracy for the GNN predictor.

Significance. If the measured speedups are reproducible and the GNN predictor generalizes, the hybrid split approach could provide a practical route to lower-latency edge inference by exploiting complementary strengths of DPU and GPU. The automation component is a constructive contribution, but the current evidence is confined to one hardware pair and a narrow set of models, so the broader impact remains conditional on further validation.

major comments (3)
  1. [Abstract and Results] Abstract and Results sections: the headline latency claims (2.48× over DPU-only, 3.37× over GPU-only) rest on single-point end-to-end timing measurements without reported error bars, number of runs, warm-up protocol or ablation of PCIe transfer and synchronization overheads; these omissions are load-bearing for the central performance assertion.
  2. [GNN-based Partition Prediction] GNN partition predictor: the reported 96.27 % accuracy is given without description of the training corpus (which models and layer features were used), validation split, or accuracy on architectures not seen during training, undermining the claim that the method automates partitioning reliably.
  3. [Evaluation Setup] Evaluation: no cross-device, cross-resolution or cross-architecture experiments are presented; the optimal cut point depends on relative layer throughput, activation sizes and interconnect bandwidth, all of which are hardware- and model-specific, so the reported factors and automation claim cannot be assumed to hold beyond the Versal VCK190 + RTX 2080 pair.
minor comments (2)
  1. Figure captions and text should explicitly state the input resolution and batch size used for all timing measurements.
  2. The manuscript would benefit from a concise table summarizing per-model split indices chosen by the GNN and the corresponding measured latencies.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate additional details, statistical reporting, and clarifications on scope and limitations.

read point-by-point responses
  1. Referee: [Abstract and Results] Abstract and Results sections: the headline latency claims (2.48× over DPU-only, 3.37× over GPU-only) rest on single-point end-to-end timing measurements without reported error bars, number of runs, warm-up protocol or ablation of PCIe transfer and synchronization overheads; these omissions are load-bearing for the central performance assertion.

    Authors: We agree that the performance results require statistical rigor and overhead analysis to be fully convincing. In the revised manuscript we will report results from 100 runs per configuration after a 10-inference warm-up, include mean latency with standard deviation and error bars in all figures and tables, and add an explicit ablation isolating PCIe transfer time and synchronization overhead from the end-to-end latency. These additions directly address the load-bearing omissions noted. revision: yes

  2. Referee: [GNN-based Partition Prediction] GNN partition predictor: the reported 96.27 % accuracy is given without description of the training corpus (which models and layer features were used), validation split, or accuracy on architectures not seen during training, undermining the claim that the method automates partitioning reliably.

    Authors: We acknowledge that the GNN section lacked necessary methodological detail. The revised version will expand this section to describe the training corpus (layer features including type, dimensions, FLOPs and activation sizes extracted from LeNet-5, all ResNet variants, VGG16 and MobileNetv2), the 80/20 train/validation split, training hyperparameters, and accuracy measured on held-out model architectures not present in the training set. This will substantiate the automation claim. revision: yes

  3. Referee: [Evaluation Setup] Evaluation: no cross-device, cross-resolution or cross-architecture experiments are presented; the optimal cut point depends on relative layer throughput, activation sizes and interconnect bandwidth, all of which are hardware- and model-specific, so the reported factors and automation claim cannot be assumed to hold beyond the Versal VCK190 + RTX 2080 pair.

    Authors: We agree that the optimal split point is hardware- and model-dependent and that broader validation would strengthen the work. The revision will add a dedicated Limitations subsection that explicitly states the results apply to the Versal VCK190 + RTX 2080 pair, discusses how layer throughput, activation sizes and interconnect bandwidth affect the cut point, and notes that the GNN predictor can be retrained for new platforms. We cannot perform additional cross-device or cross-architecture experiments in this revision due to hardware availability constraints. revision: partial

standing simulated objections not resolved
  • Performing additional cross-device, cross-resolution and cross-architecture experiments on hardware platforms other than the Versal VCK190 + RTX 2080 pair

Circularity Check

0 steps flagged

No significant circularity; claims rest on direct empirical timings and standard GNN training

full rationale

The paper's core results (latency speedups and GNN accuracy) derive from end-to-end hardware measurements on the Versal VCK190 + RTX 2080 pair for the listed CNNs, with the GNN trained to predict partition indices and its 96.27% accuracy evaluated on that data. No equation or claim reduces a 'prediction' to a fitted input by construction, no self-citation chain supports a uniqueness theorem or ansatz, and the derivation chain is self-contained via experimental validation rather than tautological re-labeling of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical timing rather than axioms or derivations; the GNN predictor is trained on the same models used for evaluation.

axioms (1)
  • domain assumption CNN layers can be partitioned and executed independently on separate accelerators with only data-transfer overhead
    Implicit in the split-inference proposal; required for the latency benefit to materialize.

pith-pipeline@v0.9.0 · 5587 in / 1223 out tokens · 43297 ms · 2026-05-09T19:47:16.947155+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 2002. Gradient- based learning applied to document recognition.Proceedings of the IEEE, 86, 11, 2278–2324

  2. [2]

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: a large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255. Oztas et al. Figure 9: Ground truth and predicted latency results for the VGG16 model. Table 2: GNN Evaluation Results Metric Name Latency Model ...

  3. [3]

    Olga Russakovsky et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, 115, 3, 211–252

  4. [4]

    Bipin Gaikwad and Abhijit Karmakar. 2021. Smart surveillance system for real-time multi-person multi-camera tracking at the edge.Journal of real-time image processing, 18, 6, 1993–2007

  5. [5]

    Daniele Berardini, Lucia Migliorelli, Alessandro Galdelli, Emanuele Frontoni, Adriano Mancini, and Sara Moccia. 2024. A deep-learning framework running on edge devices for handgun and knife detection from indoor video-surveillance cameras.Multimedia Tools and Applications, 83, 7, 19109–19127

  6. [6]

    Daniele Berardini, Adriano Mancini, Primo Zingaretti, and Sara Moccia. 2021. Edge artificial intelligence: a multi-camera video surveillance application. In International Design Engineering Technical Conferences and Computers and Infor- mation in Engineering Conference. Vol. 85437. American Society of Mechanical Engineers, V007T07A006

  7. [7]

    Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. 2016. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. InProceedings of the IEEE conference on computer vision and pattern recognition, 1874–1883

  8. [8]

    Norman P Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. InProceedings of the 44th annual international symposium on computer architecture, 1–12

  9. [9]

    Google Cloud Blog. 2025. Introducing trillium, sixth-generation tpus. Blog post. Google. (Mar. 25, 2025). Retrieved Mar. 25, 2025 from https://cloud.google.com /blog/products/compute/introducing-trillium-6th-gen-tpus

  10. [10]

    Gupta Alok. 2020. Architecture apocalypse dream architecture for deep learning inference and compute-versal ai core.Embedded World, 10

  11. [11]

    Jude Haris, Perry Gibson, José Cano, Nicolas Bohm Agostini, and David Kaeli

  12. [12]

    Secda-tflite: a toolkit for efficient development of fpga-based dnn accel- erators for edge inference.Journal of Parallel and Distributed Computing, 173, 140–151

  13. [13]

    Ehsan Aghapour, Dolly Sapra, Andy Pimentel, and Anuj Pathania. 2022. Cpu- gpu layer-switched low latency cnn inference. In2022 25th Euromicro Confer- ence on Digital System Design (DSD). IEEE, 324–331

  14. [14]

    Fabian Kreß, Tim Hotfilter, Julian Hoefer, Tanja Harbaum, Juergen Becker, et al

  15. [15]

    In2024 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

    Automated deep neural network inference partitioning for distributed embedded systems. In2024 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, 39–44

  16. [16]

    Jiyoung An, Esmerald Aliaj, and Sang-Woo Jun. 2023. Precog: near-storage accelerator for heterogeneous cnn inference. In2023 IEEE 34th International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE, 45–52

  17. [17]

    Yongjoo Jang, Sejin Kim, Daehoon Kim, Sungjin Lee, and Jaeha Kung. 2021. Deep partitioned training from near-storage computing to dnn accelerators. IEEE Computer Architecture Letters, 20, 1, 70–73

  18. [18]

    Hongsun Jang, Jaeyong Song, Jaewon Jung, Jaeyoung Park, Youngsok Kim, and Jinho Lee. 2024. Smart-infinity: fast large language model training using near-storage processing on a real system. In2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 345–360

  19. [19]

    Ao Zhou, Jianlei Yang, Tong Qiao, Yingjie Qi, Zhi Yang, Weisheng Zhao, and Chunming Hu. 2024. Graph neural networks automated design and deployment on device-edge co-inference systems. InProceedings of the 61st ACM/IEEE Design Automation Conference, 1–6

  20. [20]

    Walther Carballo-Hernández, Maxime Pelcat, and François Berry. 2023. Au- tomatic cnn model partitioning for gpu/fpga-based embedded heterogeneous accelerators using geometric programming.Journal of Signal Processing Sys- tems, 95, 10, 1203–1218

  21. [21]

    Yuji Chai, Devashree Tripathy, Chuteng Zhou, Dibakar Gope, Igor Fedorov, Ramon Matas, David Brooks, Gu-Yeon Wei, and Paul Whatmough. 2023. Perf- sage: generalized inference performance predictor for arbitrary deep learning models on edge devices.arXiv preprint arXiv:2301.10999

  22. [22]

    Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander Smola. 2020. Autogluon-tabular: robust and accurate automl for structured data.arXiv preprint arXiv:2003.06505

  23. [23]

    Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9, 8, 1735–1780

  24. [24]

    AMD. [n. d.] Dpucvdx8g product guide (pg389). AMD. Retrieved Aug. 31, 2025 from https://docs.amd.com/r/en-US/pg389-dpucvdx8g

  25. [25]

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, 770–778

  26. [26]

    Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional net- works for large-scale image recognition.arXiv preprint arXiv:1409.1556. DPU or GPU for Accelerating Neural Networks Inference – Why not both? Split CNN Inference

  27. [27]

    Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang- Chieh Chen. 2018. Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4510–4520

  28. [28]

    MLCommons. [n. d.] Mlperf inference: edge benchmark suite. MLCommons. Retrieved Aug. 31, 2025 from https://mlcommons.org/benchmarks/inference-e dge/

  29. [29]

    PyTorch. [n. d.] Model zoo — pytorch/serve master documentation. PyTorch. Retrieved Mar. 31, 2025 from https://docs.pytorch.org/serve/model_zoo.html

  30. [30]

    Keras Team. [n. d.] Keras documentation: keras applications. Keras. Retrieved Mar. 31, 2025 from https://keras.io/api/applications/

  31. [31]

    Xilinx. [n. d.] Model zoo table — vitis ai 3.5 reference documentation. Xilinx. Retrieved Mar. 31, 2025 from https://xilinx.github.io/Vitis-AI/3.5/html/docs/re ference/-/ModelZoo_Github_web.htm