arxiv: 2604.22228 · v2 · submitted 2026-04-24 · 💻 cs.DC

Recognition: unknown

Accelerating Intra-Node GPU-to-GPU Communication Through Multi-Path Transfers with CUDA Graphs

Amirhossein Sojoodi , Yiltan Hassan Temucin , Amirreza Baratisedeh , Hamed Sharifian , Ahmad Afsahi

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:57 UTC · model grok-4.3

classification 💻 cs.DC

keywords CUDA GraphsUCXGPU-to-GPU communicationmulti-path transfersintra-nodeMPIbandwidthHPC

0 comments

The pith

Integrating CUDA Graphs into UCX enables up to 2.95x faster multi-path GPU-to-GPU communication.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that CUDA Graphs can be embedded in the UCX library to orchestrate simultaneous transfers across several intra-node GPU paths including NVLink and host PCIe links. This multi-path strategy cuts communication overhead in point-to-point GPU exchanges for high-performance computing workloads. A sympathetic reader would care because many modern servers pack multiple GPUs, and data movement between them often limits overall application speed. Experiments on a four-GPU node confirm the gains in standard bandwidth benchmarks.

Core claim

By integrating CUDA Graphs into UCX, the approach captures and replays optimized multi-path communication patterns, allowing concurrent use of the host path and two GPU paths to achieve up to 2.95 times the bandwidth of single-path UCX CUDA-IPC transfers for messages up to 512 MB in OMB tests.

What carries the argument

The CUDA Graph integration in UCX for concurrent multi-path transfers, which optimizes workflows by leveraging NVLink and PCIe paths simultaneously.

If this is right

Bandwidth in GPU-to-GPU transfers improves significantly when multiple paths are used together under CUDA Graph control.
Communication overhead decreases in MPI applications running on multi-GPU nodes.
The first seamless integration of CUDA Graphs into UCX opens the door for similar optimizations in other frameworks.
Performance holds for message sizes up to 512MB on tested hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could be extended to larger GPU clusters if path diversity is available across nodes.
Application developers might see benefits in non-OMB workloads like deep learning training data exchanges.
Testing on different GPU generations would confirm if the gains generalize beyond the four-GPU node used here.

Load-bearing premise

That the overhead from CUDA Graph integration remains negligible and the multi-path configuration stays beneficial and stable on varied GPU hardware and system setups.

What would settle it

Running the OMB bandwidth test on the same four-GPU node and observing bandwidth no higher than the single-path baseline when activating the multi-path CUDA Graph feature would disprove the performance improvement.

Figures

Figures reproduced from arXiv: 2604.22228 by Ahmad Afsahi, Amirhossein Sojoodi, Amirreza Baratisedeh, Hamed Sharifian, Yiltan Hassan Temucin.

**Figure 1.** Figure 1: UCX architecture and some of its components view at source ↗

**Figure 2.** Figure 2: (a) A typical four-GPU node with NVLink (two view at source ↗

**Figure 4.** Figure 4: A simplified view of 2-D pipelined communication view at source ↗

**Figure 5.** Figure 5: A CUDA Graph-based multi-path communication view at source ↗

**Figure 2.** Figure 2: Both systems ran UCX version 1.14.0. For MPI support, we used Open MPI version 5.0.4 for both the non-CUDA Graph and CUDA Graph-based evaluations. In all tests, unless otherwise noted, we used pinned host memory and CUDA IPC for all device memory allocations and transfers. We also performed these experiments 1000 times and report the average results. All performance comparisons in this section use the trad… view at source ↗

**Figure 7.** Figure 7: Multi-Path OMB Unidirectional MPI Bandwidth view at source ↗

**Figure 8.** Figure 8: Multi-Path OMB Unidirectional MPI Bandwidth view at source ↗

**Figure 11.** Figure 11: Jacobi communication pattern (a) without view at source ↗

**Figure 9.** Figure 9: Multi-Path OMB Bidirectional MPI Bandwidth view at source ↗

**Figure 10.** Figure 10: Multi-Path OMB Bidirectional MPI Bandwidth view at source ↗

**Figure 12.** Figure 12: Jacobi runtime speedup over default UCX (UCT::CUDA-IPC) using four MPI ranks on Beluga and Narval clusters We varied the problem size by fixing the vertical dimension to 8 and increasing the horizontal dimension from 2 23 to 2 30. This means that for the total application data size of 8GB on four GPUs, each rank exchanges 256MB of boundary data with its two neighbors in each iteration. We ran the solver f… view at source ↗

**Figure 13.** Figure 13: Measurement of various CUDA Graph operations during the first iteration of OMB Latency benchmark on Narval view at source ↗

**Figure 14.** Figure 14: Measurement of various CUDA Graph operations during OMB Latency benchmark on Narval for dual-path view at source ↗

read the original abstract

Effective intra-node GPU communication is essential for optimizing performance in MPI-based HPC applications, especially when leveraging multiple communication paths. In this study, we propose a novel approach that integrates CUDA Graphs into the UCX framework to enhance intra-node multi-path point-to-point GPU communication. By concurrently leveraging multiple paths, including NVLink and PCIe through the host, and optimizing communication workflows using CUDA Graph, we achieve significant reductions in communication overhead and improve execution efficiency. To the best of our knowledge, our proposed approach is the first to seamlessly integrate CUDA Graphs into UCX. Through extensive experiments on a four-GPU node, our proposed CUDA Graph-based multi-path communication approach achieves up to a 2.95x bandwidth improvement, compared to the single-path UCX (UCT::CUDA-IPC), in GPU-to-GPU OMB bandwidth test when utilizing the host path and two other GPU paths, at message sizes up to 512MB.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds CUDA Graphs to UCX for multi-path intra-node GPU transfers and measures a 2.95x bandwidth lift versus single-path UCX on a four-GPU node.

read the letter

The core contribution is an engineering integration that captures multi-path GPU-to-GPU transfers (host path plus GPU paths) inside a CUDA Graph so UCX can replay them with lower launch cost. They report the result against the plain UCT::CUDA-IPC baseline in OMB bandwidth tests and reach the 2.95x number at sizes up to 512 MB on their test node. That is the main thing to take away: a concrete, measurable reduction in overhead for repeated point-to-point traffic inside one node.

Referee Report

1 major / 1 minor

Summary. The paper proposes integrating CUDA Graphs into the UCX framework to support concurrent multi-path intra-node GPU-to-GPU point-to-point communication (including NVLink and host PCIe paths). It claims this is the first such seamless integration and reports up to 2.95x bandwidth improvement over single-path UCX (UCT::CUDA-IPC) in OMB GPU-to-GPU bandwidth tests on a four-GPU node for message sizes up to 512 MB.

Significance. If the experimental results hold under scrutiny, the work could meaningfully improve communication efficiency in MPI-based HPC applications on multi-GPU nodes by reducing launch overhead via CUDA Graphs while exploiting multiple hardware paths. The engineering novelty of embedding CUDA Graphs within UCX is a clear strength.

major comments (1)

[Experimental results / OMB bandwidth tests] The central performance claim (2.95x bandwidth improvement) is presented without details on the number of runs, error bars, exact path configurations (which two GPU paths plus host), driver versions, or controls for cache effects. This information is load-bearing for assessing whether the observed gain is robust or reproducible.

minor comments (1)

[Abstract] The abstract states the approach 'optimizes communication workflows using CUDA Graph' but does not clarify how graph capture is performed around UCX calls or whether any modifications to UCX internals were required.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment below and will revise the manuscript to improve experimental reproducibility.

read point-by-point responses

Referee: [Experimental results / OMB bandwidth tests] The central performance claim (2.95x bandwidth improvement) is presented without details on the number of runs, error bars, exact path configurations (which two GPU paths plus host), driver versions, or controls for cache effects. This information is load-bearing for assessing whether the observed gain is robust or reproducible.

Authors: We agree that the manuscript omits these details, which are necessary for full reproducibility and scrutiny of the results. In the revised version we will expand the experimental setup section to specify the number of runs performed for each measurement, include error bars on all reported bandwidth values, provide precise descriptions of the communication paths (including which GPU pairs use NVLink versus the host PCIe path), state the CUDA driver and software versions used, and describe the controls applied to mitigate cache effects. These additions will be placed in Section 4 and will allow readers to better evaluate the robustness of the reported 2.95x improvement. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central contribution is an engineering implementation that integrates CUDA Graphs into the UCX framework for multi-path intra-node GPU communication, with performance evaluated through direct benchmarking against an external baseline (single-path UCT::CUDA-IPC). No mathematical derivation, parameter fitting presented as prediction, or self-referential equations appear in the abstract or described claims. The reported 2.95x bandwidth improvement is an observed experimental outcome on a four-GPU node rather than a result forced by definition or prior self-citation, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied systems and performance-engineering paper. No mathematical derivations, fitted parameters, or new postulated entities are introduced; the work rests on standard assumptions about CUDA and UCX behavior.

pith-pipeline@v0.9.0 · 5480 in / 1197 out tokens · 35434 ms · 2026-05-08T09:57:34.492390+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 21 canonical work pages

[1]

2025. CUDA. https://docs.nvidia.com/cuda/index.html [Accessed: 2025-04-01]

2025
[2]

CUDA Graphs

2025. CUDA Graphs. https://developer.nvidia.com/blog/cuda-graphs/ https: //developer.nvidia.com/blog/cuda-graphs/ [Accessed: 2025-04-01]

2025
[3]

CUDA Graphs in Dynamic Environments

2025. CUDA Graphs in Dynamic Environments. https://developer. nvidia.com/blog/employing-cuda-graphs-in-a-dynamic-environment/ https://developer.nvidia.com/blog/employing-cuda-graphs-in-a-dynamic- environment/ [Accessed: 2025-04-01]

2025
[4]

InfiniBand Trade Association

2025. InfiniBand Trade Association. https://www.infinibandta.org/ https: //www.infinibandta.org/ [Accessed: 2025-04-01]

2025
[5]

Multi-GPU Jacobi Solver

2025. Multi-GPU Jacobi Solver. https://github.com/NVIDIA/multi-gpu- programming-models https://github.com/NVIDIA/multi-gpu-programming- models [Accessed: 2025-04-01]

2025
[6]

Bernholdt, Swen Boehm, George Bosilca, Manjunath Gorentla Venkata, Ryan E

David E. Bernholdt, Swen Boehm, George Bosilca, Manjunath Gorentla Venkata, Ryan E. Grant, Thomas Naughton, Howard P. Pritchard, Martin Schulz, and Geoffroy R. Vallee. 2020. A survey of MPI usage in the US exascale computing project.Concurrency and Computation: Practice and Experience (CCPE)3 (2020), 1–16. doi:10.1002/cpe.4851

work page doi:10.1002/cpe.4851 2020
[7]

Devendar Bureddy, H. Wang, A. Venkatesh, S. Potluri, and D. K. Panda. 2012. OMB-GPU: A micro-benchmark suite for evaluating MPI libraries on GPU clus- ters. InProceedings of the European MPI Users’ Group Meeting (EuroMPI). 110–120. doi:10.1007/978-3-642-33518-1_16

work page doi:10.1007/978-3-642-33518-1_16 2012
[8]

Chen Chun Chen, Kawthar Shafie Khorassani, Pouya Kousha, Qinghua Zhou, Jinghan Yao, Hari Subramoni, and Dhabaleswar K. Panda. 2023. MPI-xCCL: A Portable MPI Library over Collective Communication Libraries for Various Accelerators. InProceedings of the SC Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysi...

work page doi:10.1145/3624062.3624153 2023
[9]

Yuxin Chen, Benjamin Brock, Katherine Yelick, and John D Owens. 2022. Scalable Irregular Parallelism with GPUs : Getting CPUs Out of the Way. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1–16. doi:10.1109/SC41404.2022.00055

work page Pith review doi:10.1109/sc41404.2022.00055 2022
[10]

Richards, and Laxmikant V

Jaemin Choi, David F. Richards, and Laxmikant V. Kale. 2022. Improving Scalabil- ity with GPU-Aware Asynchronous Tasks. InProceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 1–10. arXiv:2202.11819 doi:10.1109/IPDPSW55747.2022.00097

work page doi:10.1109/ipdpsw55747.2022.00097 2022
[11]

Jonah Ekelund, Stefano Markidis, and Ivy Peng. 2025. Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs.arXiv (2025), 1–8. arXiv:2501.09398 http://arxiv.org/abs/2501.09398

work page arXiv 2025
[12]

Tsung Wei Huang, Dian Lun Lin, Chun Xun Lin, and Yibo Lin. 2021. Taskflow: A Lightweight Parallel and Heterogeneous Task Graph Computing System.IEEE Transactions on Parallel and Distributed Systems(2021), 1303–1320. doi:10.1109/ TPDS.2021.3104255

work page arXiv 2021
[13]

Dian Lun Lin and Tsung Wei Huang. 2021. Efficient GPU Computation Using Task Graph Parallelism. InProceedings of the European Conference on Parallel Processing (Euro-Par). Springer International Publishing, 435–450. doi:10.1007/978-3-030- 85665-6_27

work page doi:10.1007/978-3-030- 2021
[14]

MPI Forum. 2025. https://www.mpi-forum.org/ [Accessed: 2025-04-01]

2025
[15]

MPICH. 2025. https://www.mpich.org/ [Accessed: 2025-04-01]

2025
[16]

Akira Nukada. 2022. Performance Optimization of Allreduce Operation for Multi-GPU Systems. InProceedings of the International Conference on Big Data (Big Data). IEEE, 1–6. doi:10.1109/bigdata52589.2021.9672073

work page doi:10.1109/bigdata52589.2021.9672073 2022
[17]

NVIDIA. 2025. https://www.nvidia.com/ [Accessed: 2025-04-01]

2025
[18]

NVIDIA. 2025. NVIDIA Collective Communications Library. https://github. com/NVIDIA/nccl [Accessed: 2025-04-01]

2025
[19]

Open MPI. 2025. https://www.open-mpi.org/ [Accessed: 2025-04-01]

2025
[20]

Akif Ozkan, Jurgen Teich, and Frank Hannig

Bo Qiao, M. Akif Ozkan, Jurgen Teich, and Frank Hannig. 2020. The best of both worlds: Combining CUDA graph with an image processing DSL. InProceedings of the ACM/IEEE Design Automation Conference (DAC). 1–6. doi:10.1109/DAC18072. 2020.9218531

work page doi:10.1109/dac18072 2020
[21]

Pavel Shamis, Manjunath Gorentla Venkata, M Graham Lopez, Matthew B Baker, Oscar Hernandez, Yossi Itigin, Mike Dubman, Gilad Shainer, Richard L Graham, Liran Liss, and Yiftah Shahar. 2015. UCX : An Open Source Framework for HPC Network APIs and Beyond. InProceedings of the IEEE Symposium on High- Performance Interconnects (HOTI). IEEE, 40–43. doi:10.1109/...

work page doi:10.1109/hoti.2015.13 2015
[22]

Grant, and Ahmad Afsahi

Amirhossein Sojoodi, Mohammad Akbari, Hamed Sharifian, Ali Farazdaghi, Ryan E. Grant, and Ahmad Afsahi. 2025. Accelerating Intra-Node GPU Com- munication: A Performance Model for Multi-Path Transfers. InProceedings of the Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W). Association for Computi...

work page doi:10.1145/3731599.3767392 2025
[23]

Amirhossein Sojoodi, Ali Farazdaghi, Hamed Sharifian, Ryan E Grant, and Ah- mad Afsahi. 2025. Collaborative Bandwidth-Efficient Intra-Node Allreduce. In Proceedings of the International Workshop on Accelerators and Hybrid Emerging Systems (AsHES). 1–5. doi:10.1109/IPDPSW66978.2025.00016

work page doi:10.1109/ipdpsw66978.2025.00016 2025
[24]

Amirhossein Sojoodi, Majid Salimi Beni, and Farshad Khunjush. 2020. Ignite- GPU: a GPU-enabled in-memory computing architecture on clusters.Journal of Supercomputing(2020), 1–28. doi:10.1007/s11227-020-03390-z

work page doi:10.1007/s11227-020-03390-z 2020
[25]

Amirhossein Sojoodi, Yıltan Hassan Temucin, and Ahmad Afsahi. 2024. Enhanc- ing Intra-Node GPU-to-GPU Performance in MPI + UCX through Multi-Path Communication. InProceedings of the International Workshop on Extreme Hetero- geneity Solutions (ExHET). 1–6. doi:10.1145/3642961.3643800

work page doi:10.1145/3642961.3643800 2024
[26]

Yuya Tatsugi and Akira Nukada. 2022. Accelerating data transfer between host and device using idle GPU. InProceedings of the Workshop on General Purpose Processing using GPUs (GPGPU). 1–6. doi:10.1145/3530390.3532732

work page doi:10.1145/3530390.3532732 2022
[27]

Yıltan Hassan Temucin, Amirhossein Sojoodi, Pedram Alizadeh, and Ahmad Afsahi. 2021. Efficient Multi-Path NVLink / PCIe-Aware UCX based Collective Communication for Deep Learning. InProceedings of the IEEE Symposium on High-Performance Interconnects (HOTI). 1–10. doi:10.1109/HOTI52880.2021.00018

work page doi:10.1109/hoti52880.2021.00018 2021
[28]

Yıltan Hassan Temucin, Amirhossein Sojoodi, Pedram Alizadeh, Benjamin W Kitor, and Ahmad Afsahi. 2021. Accelerating Deep Learning using Interconnect- Aware UCX Communication for MPI Collectives.IEEE Micro(2021), 1–9. doi:10. 1109/MM.2022.3148670

work page arXiv 2021
[29]

Top500. 2025. https://top500.org/ [Accessed: 2025-04-01]

2025
[30]

Unified Communication Framework Consortium. 2025. Unified Collective Com- munication (UCC). https://github.com/openucx/ucc [Accessed: 2025-04-01]

2025
[31]

Unified Communication Framework Consortium. 2025. Unified Communication X (UCX). https://openucx.org/ [Accessed: 2025-04-01]

2025
[32]

Manjunath Gorentla Venkata, Valentine Petrov, Sergey Lebedev, Devendar Bu- reddy, Ferrol Aderholdt, Joshua Ladd, Gil Bloch, Mike Dubman, and Gilad Shainer
[33]

In2024 IEEE Symposium on High-Performance Interconnects (HOTI)

Unified Collective Communication ( UCC ): An Unified Library for CPU , GPU , and DPU Collectives. InProceedings of the IEEE Symposium on High- Performance Interconnects (HOTI). IEEE, 37–46. doi:10.1109/HOTI63208.2024. 00018

work page doi:10.1109/hoti63208.2024 2024
[34]

Yuxuan Zhao, Qi Sun, Zhuolun He, Yang Bai, and Bei Yu. 2023. AutoGraph: Optimizing DNN Computation Graph for Parallel GPU Kernel Execution. Proceedings of the AAAI Conference on Artificial Intelligence37 (2023), 1–9. doi:10.1609/aaai.v37i9.26343

work page doi:10.1609/aaai.v37i9.26343 2023
[35]

Bojian Zheng, Cody Hao Yu, Jie Wang, Yaoyao Ding, Yizhi Liu, Yida Wang, and Gennady Pekhimenko. 2023. Grape: Practical and Efficient Graphed Execution for Dynamic Deep Neural Networks on GPUs. InProceedings of the International Sym- posium on Microarchitecture (MICRO). 1364–1380. doi:10.1145/3613424.3614248

work page doi:10.1145/3613424.3614248 2023