pith. sign in

arxiv: 2605.31000 · v1 · pith:N3OVRURZnew · submitted 2026-05-29 · 💻 cs.NI · cs.LG

HetCCL: Enabling Collective Communication For Mixed-Vendor Heterogeneous Clusters

Pith reviewed 2026-06-28 20:14 UTC · model grok-4.3

classification 💻 cs.NI cs.LG
keywords heterogeneous clusterscollective communicationmixed-vendor GPUsLLM trainingP2P transportAllReduceborder-communicatorhierarchical topology
0
0 comments X

The pith

HetCCL enables high-bandwidth collective communication across mixed-vendor GPU clusters by combining direct P2P transport with vendor-native reductions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HetCCL as a way to run collective operations such as AllReduce on clusters that mix GPUs from different vendors. Standard libraries either assume all hardware is identical or route data through the host with large overhead. HetCCL moves data directly between heterogeneous devices and introduces a border-communicator that re-uses each vendor library's own reduction step. A hierarchical topology abstraction then splits the collective into intra-cluster and inter-cluster phases that minimize total data movement. Evaluation on four vendor combinations shows large gains in both synthetic benchmarks and full LLM training workloads.

Core claim

HetCCL achieves vendor independence for combining collectives by using the intrinsic reduction already present inside each vendor's collective communication library, while efficient heterogeneous P2P transport removes host-device copies and a hierarchical topology abstraction guarantees optimal cross-cluster data transfer volume.

What carries the argument

The border-communicator mechanism, which achieves vendor independence by using the intrinsic reduction in the combining collectives in vendor collective communication libraries without compatibility issues.

If this is right

  • HetCCL delivers 17-19x higher bandwidth than Gloo for heterogeneous communications.
  • End-to-end LLM training per-step time improves by up to 16.9 percent.
  • The framework supports four different vendors and four heterogeneous cluster settings.
  • Collective communication is decomposed into cluster-level primitives that keep cross-cluster transfer volume minimal.
  • Control is offloaded to CPUs while data movement stays on the devices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same P2P-plus-border pattern could be applied to non-combining collectives such as Broadcast or AllGather.
  • Clusters could be assembled from lower-cost hardware mixes that were previously ruled out by communication limits.
  • The hierarchical topology abstraction may help schedulers decide when to place model layers across vendor boundaries.

Load-bearing premise

The border-communicator mechanism achieves vendor independence by using the intrinsic reduction in the combining collectives in vendor collective communication libraries without introducing compatibility issues or measurable overhead across all supported vendors.

What would settle it

A measurement on a previously untested vendor pair that shows either a compatibility failure or added latency when the border-communicator is active would falsify the claim of portable reduction without overhead.

Figures

Figures reproduced from arXiv: 2605.31000 by Guyue Liu, He Liu, Mingjun Zhang, Tao Chang, Yanmin Jia, Yan Zhang, Yonghua Lin, Yongzhe He, Yuanyuan Zhao, Yuejie Wang, Yulong Ao, Zeyu Gu, Zhiyu Li.

Figure 1
Figure 1. Figure 1: Existing and our proposed architecture for clusters with hardware from multiple vendors, dealing with the hardware heterogeneity, differences in the programming models, and varying visibility and control over the underlying hardware. Neither approach provides an efficient solution for hetero￾geneous LLM training. Device-centric libraries are restricted to single-vendor environments, while host-centric libr… view at source ↗
Figure 2
Figure 2. Figure 2: Inter-node device data transfer mechanism comparison. downstream applications, 3) minimum software adaptation required from each vendor, 4) high performance in collective communication. Fulfilling these requirements calls for an alternative architecture, as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Data path overhead of different mechanisms. requires balancing compatibility and performance between the two existing inter-node data transfer mechanisms. 2.3 Portable Reduction Challenge Challenge 2: Implement vendor-agnostic data reduction. In §2.2, we focus on the efficiency of data transfers across heterogeneous peers, which is sufficient to construct non￾combining collective operations ( [PITH_FULL_I… view at source ↗
Figure 5
Figure 5. Figure 5: P2P Transport for Device Data. and explains how HetCCL addresses the compatibility issue encountered by combining collectives. §4.3 introduces the hierarchical collective algorithm and its pipelined execution. 4.1 Device Buffer P2P Transport The basic building block of heterogeneous collective com￾munication is the underlying heterogeneous P2P transport. HetCCL enables device buffer data transfer across he… view at source ↗
Figure 6
Figure 6. Figure 6: Heterogeneous Cluster Topology Abstraction device memory. Compared with the CPU-forwarding mech￾anism, host-device memory copies are replaced with device￾to-device memory copies (similar to existing device-centric solutions), eliminating the most significant data-path over￾head. HetCCL further pipelines the above control logic to overlap the memory-copying and RDMA transfer time and to reuse a pre-allocate… view at source ↗
Figure 7
Figure 7. Figure 7: Inter-cluster data transfer primitive: c2cCpy Cluster 0 c2cRed Send/Recv rank 0 (border) rank 1 (border) 0 0 0 0 rank 4 (border) rank 5 (border) 22 22 22 0 0 22 0 0 Cluster 0 rank 0 (border) rank 1 (border) 22 22 22 22 0 0 0 0 6 6 6 0 0 6 0 0 6 6 6 6 0 0 0 0 22 22 6 6 0 0 0 0 6 6 22 22 Cluster 1 Reduce sum(0..3)+sum(4..7) Cluster 0 rank 0 (border) rank 1 (border) 0 0 0 0 0 0 0 0 0 0 0 0 28 28 28 28 6 = sum… view at source ↗
Figure 9
Figure 9. Figure 9: Pipelined Collective Algorithm Execution operation. Line 13 ∼ 14 wraps up the collective communica￾tion with another group of intra-cluster collective operations (𝑒𝑛𝑑𝐶𝑜𝑙𝑙), generating the final output value to the internal ranks in the cluster. According to the global communication pattern and the cross-cluster data transfer requirement for each type of collective operation, we summarize the intra￾cluster … view at source ↗
Figure 10
Figure 10. Figure 10: HetCCL vendor CCL wrapper performance 128K 256K 512K 1M 2M 4M 8M 16M 32M 64M 128M 256M 512M 1G 2G 4G 8G Message Size 0 10 20 30 40 Bandwidth (GB/s) nccl nv hom v1ccl v1 hom v2ccl v2 hom v3ccl v3 hom gloo nv+v3 het hetccl nv+nv hetccl nv+v1 het hetccl nv+v2 het hetccl nv+v3 het lat=0.05ms, bw=17.42GB/s lat=0.18ms, bw=8.43GB/s lat=0.17ms, bw=23.88GB/s lat=0.16ms, bw=43.59GB/s lat=1.73ms, bw=3.08GB/s lat=0.1… view at source ↗
Figure 12
Figure 12. Figure 12: Heterogeneous AllGather Performance 128K 256K 512K 1M 2M 4M 8M 16M 32M 64M 128M 256M 512M 1G 2G 4G Message Size 0 50 100 150 200 Bandwidth (GBps) nccl v1ccl v2ccl v3ccl ours nv+v1 ours nv+v2 ours nv+v3 ours v2+v3 [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 16
Figure 16. Figure 16: End-to-end speed up of HetCCL over Gloo nv 16 v3 16 het 8+8 het 16+16 het 32+32 Setup 0 10000 20000 30000 40000 Per-step Time (ms) tp2-dp8-pp1 tp1-dp8-pp2 tp2-dp16-pp1 tp1-dp16-pp2 tp2-dp8-pp2 tp1-dp32-pp2 tp2-dp16-pp2 [PITH_FULL_IMAGE:figures/full_fig_p011_16.png] view at source ↗
Figure 20
Figure 20. Figure 20: Collective breakdown of AllGather A Hierarchical Algorithm for Heterogeneous Collectives In this section, we present the detailed collective breakdown logic of AllGather and AllReduce in [PITH_FULL_IMAGE:figures/full_fig_p016_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Collective breakdown of AllReduce 17 [PITH_FULL_IMAGE:figures/full_fig_p017_21.png] view at source ↗
read the original abstract

Training Large Language Models (LLMs) on heterogeneous clusters presents significant challenges for collective communication, as hardware from multiple vendors introduces diverse network and computational characteristics. Existing collective communication frameworks (e.g., NCCL, RCCL) designed for homogeneous environments fail to address mixed-hardware setups, while communication libraries with heterogeneous support (e.g., Gloo, OpenMPI) incur heavy overhead in the data path. This paper presents HetCCL, a framework that enables heterogeneous collective communication by efficient P2P transport across heterogeneous devices (e.g., GPUs), eliminating the host-device memory copy overhead while offloading the control to the CPUs. For combining collectives (e.g., AllReduce, ReduceScatter), HetCCL introduces a border-communicator mechanism that achieves vendor independence by using the intrinsic reduction in the combining collectives in vendor collective communication libraries. With efficient heterogeneous P2P transport and portable reduction mechanism, HetCCL proposes a hierarchical topology abstraction for heterogeneous clusters, dissecting collective communication into cluster-level primitives that guarantee optimal cross-cluster data transfer volume and optimal bandwidth utilization. We implement HetCCL with 4 different vendor support and evaluate it in 4 heterogeneous settings with benchmarks and end-to-end LLM tasks. Our evaluation shows that HetCCL achieves 17-19x higher bandwidth than Gloo in heterogeneous communications, and speeds up end-to-end training by up to 16.9% in the per-step-time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents HetCCL, a framework for collective communication in mixed-vendor heterogeneous GPU clusters used for LLM training. It introduces efficient P2P transport across devices to avoid host-device copies, a border-communicator mechanism that delegates reduction to vendor-native combining collectives (AllReduce/ReduceScatter) for vendor independence, and a hierarchical topology abstraction that decomposes communication into cluster-level primitives for optimal data volume and bandwidth. The work claims support for four vendors and reports empirical results from four heterogeneous settings, including 17-19x higher bandwidth than Gloo and up to 16.9% improvement in per-step training time.

Significance. If the performance claims are substantiated, the work would be significant for practical deployment of large-scale training on cost-effective heterogeneous clusters, filling a gap left by homogeneous-focused libraries like NCCL/RCCL and high-overhead heterogeneous ones like Gloo. The empirical focus on end-to-end LLM tasks provides direct evidence of applicability, though the absence of reproducibility artifacts (e.g., code or detailed benchmarks) limits immediate impact.

major comments (2)
  1. [Abstract] Abstract: The headline claims of 17-19x bandwidth improvement over Gloo and 16.9% end-to-end per-step speedup are presented with no description of experimental setup, hardware configurations for the four heterogeneous settings, number of runs, error bars, statistical tests, or data exclusion rules. These numbers are load-bearing for the central empirical contribution and cannot be assessed without this information.
  2. [Design and Evaluation] Border-communicator mechanism (design and evaluation sections): The claim that this mechanism achieves vendor independence with zero measurable overhead by delegating to intrinsic reduction in each vendor's collective library (without compatibility issues or host fallbacks) is central to attributing the reported speedups to the framework rather than implementation artifacts. No per-vendor micro-benchmarks isolating delegation latency or confirming successful hand-off across all four vendors are described, leaving the weakest assumption untested.
minor comments (1)
  1. [Abstract] The abstract and evaluation summary refer to '4 different vendor support' and '4 heterogeneous settings' without naming the vendors or providing a table summarizing the cluster compositions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below with clarifications from the manuscript and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claims of 17-19x bandwidth improvement over Gloo and 16.9% end-to-end per-step speedup are presented with no description of experimental setup, hardware configurations for the four heterogeneous settings, number of runs, error bars, statistical tests, or data exclusion rules. These numbers are load-bearing for the central empirical contribution and cannot be assessed without this information.

    Authors: The abstract provides a concise summary of results; the Evaluation section contains the full experimental details, including hardware configurations for the four heterogeneous settings, number of runs, and performance metrics. To improve accessibility of the headline claims, we will revise the abstract to briefly reference the key experimental configurations and direct readers to the Evaluation section for complete setup, statistical, and reproducibility information. revision: yes

  2. Referee: [Design and Evaluation] Border-communicator mechanism (design and evaluation sections): The claim that this mechanism achieves vendor independence with zero measurable overhead by delegating to intrinsic reduction in each vendor's collective library (without compatibility issues or host fallbacks) is central to attributing the reported speedups to the framework rather than implementation artifacts. No per-vendor micro-benchmarks isolating delegation latency or confirming successful hand-off across all four vendors are described, leaving the weakest assumption untested.

    Authors: The border-communicator delegates reduction to each vendor's native combining collectives to ensure independence and avoid host fallbacks, with the four-vendor support and heterogeneous evaluation results demonstrating successful operation. While overall bandwidth and end-to-end gains provide supporting evidence, we agree that isolated per-vendor micro-benchmarks would more directly confirm zero delegation overhead. We will add these micro-benchmarks to the revised Design and Evaluation sections. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems paper with benchmark-driven claims

full rationale

The paper describes an implementation framework (HetCCL) for heterogeneous collective communication, introducing mechanisms like border-communicator and hierarchical topology abstraction. All central claims (17-19x bandwidth improvement, 16.9% end-to-end speedup) are presented as outcomes of empirical evaluation across four vendors and settings, with no equations, fitted parameters, derivations, or self-citation chains that reduce the reported results to inputs by construction. The weakest assumption (zero-overhead vendor independence via intrinsic reduction) is an engineering claim tested in benchmarks rather than a self-definitional or fitted quantity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on the domain assumption that P2P transport is available and efficient across the four vendors and that each vendor library exposes an intrinsic reduction operation usable by the border-communicator.

axioms (2)
  • domain assumption P2P transport between heterogeneous devices eliminates host-device memory copy overhead
    Stated as the basis for efficient heterogeneous P2P transport in the abstract.
  • domain assumption Vendor collective libraries expose intrinsic reduction operations that can be composed portably
    Invoked to justify the border-communicator mechanism for combining collectives.
invented entities (1)
  • border-communicator no independent evidence
    purpose: Achieve vendor independence in combining collectives by leveraging intrinsic reductions
    New mechanism introduced to combine collectives across vendors

pith-pipeline@v0.9.1-grok · 5821 in / 1331 out tokens · 24397 ms · 2026-06-28T20:14:53.044068+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 8 canonical work pages · 4 internal anchors

  1. [1]

    2024. OpenMPI. (2024).https://www.open-mpi.org/

  2. [2]

    AMD. 2024. AMD Instinct Accelerators. (2024).https://www.amd. com/en/products/accelerators/instinct.html

  3. [3]

    AMD. 2024. RCCL. (2024).https://github.com/ROCm/rccl

  4. [4]

    AMD. 2024. ROCm. (2024).https://www.amd.com/en/products/ software/rocm.html

  5. [5]

    Wei An, Xiao Bi, Guanting Chen, Shanhuang Chen, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Wenjun Gao, Kang Guan, et al

  6. [6]

    InSC24: International Conference for High Performance Computing, Networking, Storage and Analysis

    Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co- Design for Deep Learning. InSC24: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–23

  7. [7]

    Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi. 2021. Synthesizing op- timal collective algorithms. InProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 62–75

  8. [8]

    Chen-Chun Chen, Kawthar Shafie Khorassani, Pouya Kousha, Qinghua Zhou, Jinghan Yao, Hari Subramoni, and Dhabaleswar K Panda. 2023. MPI-xCCL: A Portable MPI Library over Collective Communication Libraries for Various Accelerators. InProceedings of the SC’23 Work- shops of The International Conference on High Performance Computing, Network, Storage, and Ana...

  9. [9]

    Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, and Zhiru Zhang. 2024. Understanding the potential of fpga-based spatial acceleration for large language model inference.ACM Transactions on Reconfigurable Technology and Systems(2024)

  10. [10]

    UCF Consortium. 2024. Unified Communication X Library Source Codehttps://github.com/openucx/ucx. (2024)

  11. [11]

    UCF Consortium. 2024. Unified Communication Xhttps://openucx. org/. (2024)

  12. [12]

    NVIDIA Corporation. 2026. GPU-Direct RDMA (GDR).https:// developer.nvidia.com/gpudirect. (2026). Accessed: 2026-02-07

  13. [13]

    NVIDIA Corporation. 2026. NVSHMEM: NVIDIA SHMEM Library. https://developer.nvidia.com/nvshmem. (2026). Accessed: 2026-02-07

  14. [14]

    Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

  15. [15]

    DeepSeek-V3 Technical Report. (2024). arXiv:cs.CL/2412.19437 https://arxiv.org/abs/2412.19437

  16. [16]

    Facebook. 2024. Gloo. (2024).https://github.com/facebookincubator/ gloo/

  17. [17]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch trans- formers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39

  18. [18]

    Graphcore. 2024. Graphcore. (2024).https://www.graphcore.ai/

  19. [19]

    Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan, Hari Subramoni, Ching-Hsiang Chu, and Dhabaleswar K Panda. 2015. Ex- ploiting GPUDirect RDMA in designing high performance OpenSH- MEM for NVIDIA GPU clusters. In2015 IEEE International Conference on Cluster Computing. IEEE, 78–87

  20. [20]

    Seongmin Hong, Seungjae Moon, Junsoo Kim, Sungjae Lee, Minsub Kim, Dongsoo Lee, and Joo-Young Kim. 2022. Dfx: A low-latency multi- fpga appliance for accelerating transformer-based text generation. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 616–630

  21. [21]

    Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li, Xiaofan Zhang, and Deming Chen. 2024. New solutions on LLM acceleration, optimization, and application. InPro- ceedings of the 61st ACM/IEEE Design Automation Conference. 1–4

  22. [22]

    Huawei. 2024. Ascend Computing. (2024).https://e.huawei.com/en/ products/computing/ascend

  23. [23]

    Intel. 2024. OneCCL. (2024).https://www.intel.com/content/www/us/ en/developer/tools/oneapi/oneccl.html

  24. [24]

    Xianyan Jia, Le Jiang, Ang Wang, Wencong Xiao, Ziji Shi, Jie Zhang, Xinyuan Li, Langshi Chen, Yong Li, Zhen Zheng, et al. 2022. Whale: Efficient giant model training over heterogeneous {GPUs }. In2022 USENIX Annual Technical Conference (USENIX ATC 22). 673–688

  25. [25]

    Christoforos Kachris. 2025. A survey on hardware accelerators for large language models.Applied Sciences15, 2 (2025), 586

  26. [26]

    Khronos. 2024. SYCL. (2024).https://www.khronos.org/sycl/

  27. [27]

    Heehoon Kim, Junyeol Ryu, and Jaejin Lee. 2024. TCCL: Discovering Better Communication Paths for PCIe GPU Clusters. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 999–1015

  28. [28]

    Xuting Liu, Behnaz Arzani, Siva Kesava Reddy Kakarla, Liangyu Zhao, Vincent Liu, Miguel Castro, Srikanth Kandula, and Luke Marshall

  29. [29]

    InProceedings of the ACM SIGCOMM 2024 Conference

    Rethinking machine learning collective communication as a multi-commodity flow problem. InProceedings of the ACM SIGCOMM 2024 Conference. 16–37

  30. [30]

    Llama Team, AI @ Meta. 2024. The Llama 3 Herd of Models. (2024). arXiv:cs.AI/2407.21783https://arxiv.org/abs/2407.21783

  31. [31]

    Microsoft. 2024. MSCCL. (2024).https://github.com/microsoft/msccl

  32. [32]

    NVIDIA. 2024. A100. (2024).https://www.nvidia.com/en-us/ data-center/a100/

  33. [33]

    NVIDIA. 2024. NCCL. (2024).https://developer.nvidia.com/nccl

  34. [34]

    OpenACC Organization. 2024. OpenACC. (2024).https://www. openacc.org/

  35. [35]

    2020.{HetPipe}: Enabling large {DNN} training on (whimpy) heterogeneous {GPU } clusters through integration of pipelined model parallelism and data parallelism

    Jay H Park, Gyeongchan Yun, M Yi Chang, Nguyen T Nguyen, Seung- min Lee, Jaesik Choi, Sam H Noh, and Young-ri Choi. 2020.{HetPipe}: Enabling large {DNN} training on (whimpy) heterogeneous {GPU } clusters through integration of pipelined model parallelism and data parallelism. In2020 USENIX Annual Technical Conference (USENIX ATC 20). 307–321. 14 HetCCL ar...

  36. [36]

    Sreeram Potluri, Khaled Hamidouche, Akshay Venkatesh, Devendar Bureddy, and Dhabaleswar K Panda. 2013. Efficient inter-node MPI communication using GPUDirect RDMA for InfiniBand clusters with NVIDIA GPUs. In2013 42nd International Conference on Parallel Pro- cessing. IEEE, 80–89

  37. [37]

    PyTorch. 2024. Custom C++ and CUDA Extensions. (2024).https: //pytorch.org/tutorials/advanced/cpp_extension.html

  38. [38]

    PyTorch. 2024. Third-party backends. (2024).https://pytorch.org/ docs/stable/distributed.html#third-party-backends

  39. [39]

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

  40. [40]

    ZeRO: Memory Optimizations Toward Training Trillion Param- eter Models. (2020). arXiv:cs.LG/1910.02054https://arxiv.org/abs/1910. 02054

  41. [41]

    Xiaozhe Ren, Pingyi Zhou, Xinfan Meng, Xinjing Huang, Yadao Wang, Weichao Wang, Pengfei Li, Xiaoda Zhang, Alexander Podolskiy, Grig- ory Arshinov, Andrey Bout, Irina Piontkovskaya, Jiansheng Wei, Xin Jiang, Teng Su, Qun Liu, and Jun Yao. 2023. PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing. (2023). arXiv:cs.CL/2303....

  42. [42]

    Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi, and Rachee Singh. 2023. {TACCL}: Guiding Collective Algorithm Syn- thesis using Communication Sketches. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 593–612

  43. [43]

    Graham Lopez, Matthew B

    Pavel Shamis, Manjunath Gorentla Venkata, M. Graham Lopez, Matthew B. Baker, Oscar Hernandez, Yossi Itigin, Mike Dubman, Gilad Shainer, Richard L. Graham, Liran Liss, Yiftah Shahar, Sreeram Potluri, Davide Rossetti, Donald Becker, Duncan Poole, Christopher Lamb, Sameer Kumar, Craig Stunkel, George Bosilca, and Aurelien Bouteiller

  44. [44]

    In2015 IEEE 23rd Annual Symposium on High-Performance Interconnects

    UCX: An Open Source Framework for HPC Network APIs and Beyond. In2015 IEEE 23rd Annual Symposium on High-Performance Interconnects

  45. [45]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. (2020). arXiv:cs.CL/1909.08053https://arxiv.org/abs/1909.08053

  46. [46]

    Philippe Tillet, Hsiang-Tsung Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computa- tions. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 10–19

  47. [47]

    Taegeon Um, Byungsoo Oh, Minyoung Kang, Woo-Yeon Lee, Goeun Kim, Dongseob Kim, Youngtaek Kim, Mohd Muzzammil, and Myeong- jae Jeon. 2024. Metis: Fast Automatic Distributed Training on Hetero- geneous GPUs. In2024 USENIX Annual Technical Conference (USENIX ATC 24). USENIX Association, Santa Clara, CA, 563–578.https: //www.usenix.org/conference/atc24/presen...

  48. [48]

    Manjunath Gorentla Venkata, Valentine Petrov, Sergey Lebedev, De- vendar Bureddy, Ferrol Aderholdt, Joshua Ladd, Gil Bloch, Mike Dubman, and Gilad Shainer. 2024. Unified Collective Communi- cation (UCC): An Unified Library for CPU, GPU, and DPU Collec- tives. InIEEE Symposium on High-Performance Interconnects, HOTI 2024, Albuquerque, NM, USA, August 21-23...

  49. [49]

    Jinkyu Yim, Jaeyong Song, Yerim Choi, Jaebeen Lee, Jaewon Jung, Hongsun Jang, and Jinho Lee. 2024. Pipette: Automatic Fine-Grained Large Language Model Training Configurator for Real-World Clusters. In2024 Design, Automation and Test in Europe Conference and Exhi- bition, DATE 2024 - Proceedings (Proceedings -Design, Automation and Test in Europe, DATE). ...

  50. [50]

    Liangyu Zhao, Saeed Maleki, Aashaka Shah, Ziyue Yang, Hossein Pour- reza, and Arvind Krishnamurthy. 2024. Forestcoll: Efficient collective communications on heterogeneous network fabrics.arXiv preprint arXiv:2402.06787(2024)

  51. [51]

    Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chenhui Huang, Tianqi Li, Haojie Duanmu, Renze Chen, Ruifan Xu, et al. 2025. Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler.arXiv preprint arXiv:2504.19442 (2025)

  52. [52]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 193–210. 15 arXiv preprint, 2026 Wang et al. Border rank Internal ran...