arxiv: 2605.04478 · v1 · submitted 2026-05-06 · 💻 cs.DC · cs.AI

Recognition: unknown

CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training

Dingwen Tao, Fakang Wang, Feng Yu, Guangming Tan, Hairui Zhao, Haoxu Li, Jianhao Fu, Jinwu Yang, Qianyu Zhang, Qian Zhao, Tao Wang, Wenjing Huang, Xingchen Liu, Yang Tian, Yida Gu, Yifan Chen, Yueyuan Zhou, Zedong Liu, Zhan Wang, Zhenhang Sun

Pith reviewed 2026-05-08 17:36 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords anomaly detectioncollective communicationslow hang anomaliesdistributed trainingGPU clustersroot cause analysislarge-scale systemsdiagnostic tools

0 comments

The pith

CCL-D detects and locates slow or hang anomalies in large-scale model training by combining a lightweight real-time probe with an intelligent analyzer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large-scale training of AI models on thousands of GPUs frequently encounters slow or hung collective communication that stalls progress for long periods. Traditional diagnostic methods often require hours or days to identify the root cause amid complex hardware and software interactions. CCL-D addresses this by deploying a rank-level probe to measure cross-layer metrics through lightweight tracing and pairing it with an automated analyzer that detects anomalies and pinpoints the faulty GPU rank. If the system performs as claimed, training clusters would recover from these disruptions far more quickly, cutting downtime in massive distributed environments. The authors report that a year-long deployment on a 4,000-GPU cluster achieved near-complete coverage of known anomalies and located issues within 6 minutes.

Core claim

CCL-D integrates a rank-level real-time probe that measures cross-layer anomaly metrics using a lightweight distributed tracing framework to monitor communication traffic with an intelligent decision analyzer that performs automated anomaly detection and root-cause location to precisely identify the faulty GPU rank. When deployed on a 4,000-GPU cluster over one year, the system achieved near-complete coverage of known slow and hang anomalies and pinpointed affected ranks within 6 minutes, substantially outperforming existing solutions.

What carries the argument

The rank-level real-time probe paired with the intelligent decision analyzer, which together track cross-layer metrics in real time and automate detection plus localization of slow or hang issues.

Load-bearing premise

The cross-layer metrics collected by the lightweight probe are sufficient for the analyzer to distinguish all slow and hang root causes from normal variation without missing novel anomalies or generating high false positives.

What would settle it

Running CCL-D on a comparable large cluster during a documented slow or hang anomaly and finding that it either misses the event, mislocates the faulty rank, or takes longer than six minutes to report the issue would challenge the central performance claims.

Figures

Figures reproduced from arXiv: 2605.04478 by Dingwen Tao, Fakang Wang, Feng Yu, Guangming Tan, Hairui Zhao, Haoxu Li, Jianhao Fu, Jinwu Yang, Qianyu Zhang, Qian Zhao, Tao Wang, Wenjing Huang, Xingchen Liu, Yang Tian, Yida Gu, Yifan Chen, Yueyuan Zhou, Zedong Liu, Zhan Wang, Zhenhang Sun.

**Figure 1.** Figure 1: Training interruptions and slow/hang root-causes. Motivation. While considerable efforts have focused on fault tolerance mechanisms—such as checkpointing [13, 24, 29, 52]—these solutions primarily aim to reduce recovery latency. However, they fall short of addressing the root causes of the widening reliability gap in large-scale training. Simply restarting failed tasks without identifying and addressing t… view at source ↗

**Figure 2.** Figure 2: Position of CCL in training and its hierarchical structure. These gaps in diagnostic accuracy and efficiency have become a critical bottleneck in advancing the reliability of largescale distributed training systems. Our Solution. To address these challenges, we propose CCL-D , a diagnostic system capable of automatically detecting and precisely locating slow/hang anomalies within minutes. The core idea … view at source ↗

**Figure 5.** Figure 5: Metrics of CCL-D and corresponding anomaly types. information layer serve both as a means for communication traffic identification and as a foundation for basic diagnosis. 4.1.1 Analysis of Kernel-level Metrics. To improve diagnostic accuracy while minimizing system complexity, it is essential to identify metrics that capture the root causes of Hang/Slow anomalies, thereby reducing reliance on redundant … view at source ↗

**Figure 6.** Figure 6: Comparison of SendRate between normal and slow ranks. can be used to estimate kernel Duration Time per rank to make an initial judgment, this approach is limited by NTP clock drift [6, 28] and the millisecond-to-microsecond scale of collective operations, making the time-based diagnosis inaccurate. Moreover, SendCount/RecvCount fail to capture such anomalies, as final counts may remain consistent despite … view at source ↗

**Figure 8.** Figure 8: The structure of Trace ID. 𝑃 → 0. Conceptually, 𝑃 = 0 corresponds to purely slow communication, 𝑃 = 1 to purely slow computation, and 𝑃 = 0.5 to an equal contribution of both. However, in large-scale clusters, the equal-contribution scenario rarely occurs. To emphasize the dominant anomaly type, we introduce two boundary parameters 𝛼 and 𝛽 around 0.5 (e.g., 𝛼 = 0.4, 𝛽 = 0.6). When 𝑃 > 𝛽, the anomaly con… view at source ↗

**Figure 7.** Figure 7: Decision tree of root-cause ranks. H1-H3 and S1-S3 correspond to the hang/slow discussed in Section 2.2. them. In large-scale clusters, anomalies may propagate across ranks, making the true root-cause rank difficult to isolate. To address this, the location module analyzes kernel metrics from all ranks and applies distinct algorithms to separate root-cause from secondary effects view at source ↗

**Figure 9.** Figure 9: The structure of Probing Frame. flag (indicating whether metric measurement is enabled), the kernelIndex, and the number of communication channels. The body is cyclically partitioned into blocks, with kernelIndex specifying the block position for the current operation (computed as counter modulo the number of blocks). Within each round of communication, every channel is assigned two consecutive slots to r… view at source ↗

**Figure 11.** Figure 11: Communication traffic identification overhead and CPU usage per node for anomaly diagnosis at different GPU scales view at source ↗

**Figure 12.** Figure 12: Comparison of normalized communication time (vs. Original) for different operations on 16–128 GPUs. with Section 5.1, where the probing frame consists of a 32- byte shared header and a 1152-byte body divided into 8 sub-blocks, each corresponding to one Trace ID and mapped to 8 communication channels. Without using the reserved bytes, each Trace ID occupies 16 Bytes. The memory usage of RAS and C4D could n… view at source ↗

**Figure 13.** Figure 13: Per-step time and loss over time under different large models. others, as it measures fine-grained metrics inside GPU kernels to ensure higher diagnostic accuracy. Although RAS achieves lower overhead by only recording operation counts, its limited anomaly coverage is unacceptable. 6.3.3 Training Efficiency and Accuracy. We test 4 models with FSDP or 3D parallelism. Llama2-7B and Llama3.1-8B are trained… view at source ↗

read the original abstract

As training scales grow, collective communication libraries (CCL) increasingly face anomalies arising from complex interactions among hardware, software, and environmental factors. These anomalies typically manifest as slow/hang communication, the most frequent and time-consuming category to diagnose. However, traditional diagnostic methods remain inaccurate and inefficient, frequently requiring hours or even days for root cause analysis. To address this, we propose CCL-D, a high-precision diagnostic system designed to detect and locate slow/hang anomalies in large-scale distributed training. CCL-D integrates a rank-level real-time probe with an intelligent decision analyzer. The probe measures cross-layer anomaly metrics using a lightweight distributed tracing framework to monitor communication traffic. The analyzer performs automated anomaly detection and root-cause location, precisely identifying the faulty GPU rank. Deployed on a 4,000-GPU cluster over one year, CCL-D achieved near-complete coverage of known slow/hang anomalies and pinpointed affected ranks within 6 minutes-substantially outperforming existing solutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CCL-D is a deployed engineering system for CCL anomaly diagnosis with concrete cluster data, but the evaluation stays thin on numbers and comparisons.

read the letter

CCL-D combines rank-level real-time probing via lightweight distributed tracing with an automated analyzer to catch slow and hang problems in collective communication during large-scale training. The paper reports a year-long run on a 4,000-GPU cluster that achieved near-complete coverage of known cases and located the faulty rank in roughly six minutes, which is the strongest part of the work. That deployment supplies the main evidence that the chosen cross-layer metrics can separate anomalies from normal behavior in practice. The integration itself is not a brand-new primitive, but applying it end-to-end at this scale with an intelligent decision layer is a useful concrete step for the domain. The system description is clear enough on the probe side and on the overall architecture. The soft spot is the evaluation. The abstract and the reported results give no false-positive rates, no precision or recall figures, no baseline comparisons against existing tracing or monitoring tools, and only high-level outline of how the analyzer reaches its root-cause decisions. Without those details the claim that it substantially outperforms prior solutions rests mostly on the single deployment narrative rather than on reproducible measurements. Minor gaps include limited discussion of overhead under normal operation and how the system would handle entirely novel anomaly patterns not seen in the one-year trace. This paper is aimed at practitioners who manage large GPU clusters and need faster diagnosis than current manual or coarse-grained methods. It will be most useful to readers who already work with distributed tracing frameworks and want a ready-to-adapt example rather than a theoretical advance. The deployment record is solid enough to justify sending it to peer review in a systems venue, even though the quantitative backing needs strengthening before publication.

Referee Report

1 major / 0 minor

Summary. The paper presents CCL-D, a diagnostic system for slow/hang anomalies in collective communication libraries during large-scale model training. It combines a rank-level real-time probe that collects cross-layer anomaly metrics via a lightweight distributed tracing framework with an intelligent decision analyzer that performs automated detection and identifies the faulty GPU rank. The system was deployed on a 4,000-GPU cluster for one year and is reported to have achieved near-complete coverage of known anomalies while localizing affected ranks within 6 minutes, substantially outperforming existing solutions.

Significance. If the performance claims hold under rigorous evaluation, CCL-D would address a critical practical bottleneck in large-scale distributed training by reducing anomaly diagnosis time from hours or days to minutes. The year-long deployment on a production-scale cluster constitutes real-world evidence of utility in the cs.DC domain, though the absence of supporting metrics limits the ability to gauge its broader impact on training reliability and efficiency.

major comments (1)

[Abstract] Abstract: The central claims of 'near-complete coverage of known slow/hang anomalies' and localization 'within 6 minutes' are asserted without any reported quantitative accuracy metrics, false-positive rates, baseline comparisons to existing diagnostic tools, or details on the decision logic inside the intelligent analyzer. These omissions are load-bearing because the soundness of the 'high-precision' and 'substantially outperforming' assertions rests entirely on the unverified deployment outcomes.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We have addressed the major comment point by point below, making revisions where feasible while being transparent about the constraints of our production deployment evaluation.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of 'near-complete coverage of known slow/hang anomalies' and localization 'within 6 minutes' are asserted without any reported quantitative accuracy metrics, false-positive rates, baseline comparisons to existing diagnostic tools, or details on the decision logic inside the intelligent analyzer. These omissions are load-bearing because the soundness of the 'high-precision' and 'substantially outperforming' assertions rests entirely on the unverified deployment outcomes.

Authors: We agree that the original abstract presented the deployment outcomes in summary form without sufficient quantitative backing or methodological detail, which weakens the verifiability of the claims. The figures derive from post-hoc analysis of all slow/hang incidents logged over the year-long run on the 4,000-GPU cluster, where CCL-D identified every anomaly that was later confirmed by operators. In the revision we will expand the abstract to include concrete supporting numbers (e.g., total incidents processed, mean and 95th-percentile localization latency, and a one-sentence outline of the analyzer's hybrid rule-plus-model decision procedure). We have also inserted a short limitations paragraph noting that, because the system ran in live production without parallel execution of alternative tools or exhaustive ground-truth labeling, formal false-positive rates and head-to-head baselines are not available from this deployment. These points are now explicitly stated rather than left implicit. revision: partial

standing simulated objections not resolved

We cannot supply baseline comparisons or false-positive rates because the evaluation occurred in an uninterrupted production environment; running competing diagnostic systems in parallel or obtaining independent labels for every event would have required halting training jobs, which was not feasible.

Circularity Check

0 steps flagged

No significant circularity; engineering system with empirical deployment support

full rationale

The paper presents CCL-D as an engineering diagnostic system combining a lightweight probe for cross-layer metrics with an intelligent analyzer for anomaly detection and root-cause localization. Its central claims rest on a year-long deployment across a 4,000-GPU cluster that reports near-complete coverage of known slow/hang cases and 6-minute rank localization. No mathematical derivations, equations, parameter fittings, predictions from first principles, or self-citation chains appear in the provided text. The evaluation is purely empirical and externally falsifiable via the reported deployment outcomes, with no reduction of results to inputs by construction. This is the expected non-finding for a systems paper without theoretical modeling.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

As an applied systems paper, the work relies on standard assumptions from distributed computing and monitoring rather than new mathematical axioms or fitted parameters.

axioms (2)

domain assumption Lightweight distributed tracing can capture cross-layer communication metrics with negligible overhead in large clusters.
Invoked to justify the real-time probe design.
domain assumption Cross-layer anomaly metrics are sufficient for automated detection and precise root-cause localization of slow/hang events.
Basis for the intelligent decision analyzer.

invented entities (1)

CCL-D diagnostic system no independent evidence
purpose: High-precision detection and localization of slow/hang anomalies
The proposed integrated probe-plus-analyzer framework itself.

pith-pipeline@v0.9.0 · 5536 in / 1243 out tokens · 62301 ms · 2026-05-08T17:36:18.992553+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 12 canonical work pages · 6 internal anchors

[1]

Palwisha Akhtar, Erhan Tezcan, Fareed Mohammad Qararyah, and Didem Unat. 2020. ComScribe: identifying intra-node GPU communi- cation. InInternational Symposium on Benchmarking, Measuring and Optimization. Springer, 157–174

2020
[2]

AMD. 2025. RCCL: ROCm Communication Collectives Library.https: //github.com/ROCm/rccl. Accessed August 25, 2025

2025
[3]

BigScience. 2025. BLOOM 176B Training Log.https://github. com/bigscience-workshop/bigscience/blob/master/train/tr11-176B- ml/chronicles.md. Accessed August 25, 2025

2025
[4]

Sanghun Cho, Hyojun Son, and John Kim. 2023. Logical/physical topology-aware collective communication in deep learning training. In2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 56–68

2023
[5]

Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky. 2021. NVIDIA a100 tensor core gpu: Performance and innovation.IEEE Micro41, 2 (2021), 29–35

2021
[6]

James C Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christo- pher Frost, Jeffrey John Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, et al. 2013. Spanner: Google’s globally distributed database.ACM Transactions on Computer Systems (TOCS)31, 3 (2013), 1–22

2013
[7]

Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, et al. 2024. A survey on multimodal large language models for autonomous driv- ing. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 958–979

2024
[8]

Weihao Cui, Ji Zhang, Han Zhao, Chao Liu, Wenhao Zhang, Jian Sha, Quan Chen, Bingsheng He, and Minyi Guo. 2025. XPUTimer: Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand- Plus Scale.arXiv preprint arXiv:2502.05413(2025)

work page arXiv 2025
[9]

Huangliang Dai, Shixun Wu, Jiajun Huang, Zizhe Jian, Yue Zhu, Haiyang Hu, and Zizhong Chen. 2025. FT-Transformer: Resilient and reliable transformer with end-to-end fault tolerant attention. In Proceedings of the International Conference for High Performance Com- puting, Networking, Storage and Analysis. 1085–1098

2025
[10]

Yangtao Deng, Xiang Shi, Zhuo Jiang, Xingjian Zhang, Lei Zhang, Zhang Zhang, Bo Li, Zuquan Song, Hang Zhu, Gaohong Liu, et al
[11]

In22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25)

Minder: Faulty machine detection for large-scale distributed model training. In22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25). 505–521
[12]

Jianbo Dong, Bin Luo, Jun Zhang, Pengcheng Zhang, Fei Feng, Yikai Zhu, Ang Liu, Zian Chen, Yi Shi, Hairong Jiao, et al. 2025. Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1246–1258

2025
[13]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Ka- dian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review arXiv 2024
[14]

Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, and Murali Annavaram. 2022. Check-N-Run: A check- pointing system for training deep learning recommendation models. In19th USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 22). 929–943

2022
[15]

Yanjie Gao, Jiyu Luo, Haoxiang Lin, Hongyu Zhang, Ming Wu, and Mao Yang. 2025. dl 2: Detecting Communication Deadlocks in Deep Learning Jobs. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 27–38

2025
[16]

Jiayi Huang, Pritam Majumder, Sungkeun Kim, Abdullah Muzahid, Ki Hwan Yum, and Eun Jung Kim. 2021. Communication algorithm- architecture co-design for distributed deep learning. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 181–194

2021
[17]

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems32 (2019)

2019
[18]

Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Jun- jie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of Large- Scale Multi-Tenant GPU clusters for DNN training workloads. In2019 USENIX Annual Technical Conference (USENIX ATC 19). 947–960

2019
[19]

Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanx- iong Guo. 2020. A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 463–479

2020
[20]

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al
[21]

In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)

MegaScale: Scaling large language model training to more than 10,000 GPUs. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 745–760
[22]

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Ben- jamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361(2020)

work page internal anchor Pith review arXiv 2020
[23]

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova
[24]

InProceedings of naacL-HLT, Vol

Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of naacL-HLT, Vol. 1. Minneapolis, Minnesota
[25]

Hongbo Li, Zizhong Chen, and Rajiv Gupta. 2017. Parastack: Efficient hang detection for mpi programs at large scale. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–12

2017
[26]

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. 2020. Pytorch distributed: Experiences on accelerating data parallel training.arXiv preprint arXiv:2006.15704(2020)

work page arXiv 2020
[27]

Wenshuo Li, Xinghao Chen, Han Shu, Yehui Tang, and Yunhe Wang
[28]

CCL-D PPoPP ’26, January 31 – February 4, 2026, Sydney, NSW, Australia

ExCP: Extreme LLM Checkpoint Compression via Weight- Momentum Joint Shrinking.arXiv preprint arXiv:2406.11257(2024). CCL-D PPoPP ’26, January 31 – February 4, 2026, Sydney, NSW, Australia

work page arXiv 2024
[29]

Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, and Gedas Bertasius
[30]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Vision transformers are parameter-efficient audio-visual learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2299–2309
[31]

Kefei Liu, Zhuo Jiang, Jiao Zhang, Haoran Wei, Xiaolong Zhong, Lizhuang Tan, Tian Pan, and Tao Huang. 2023. Hostping: Diagnos- ing intra-host network bottlenecks in RDMA servers. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 15–29

2023
[32]

Wei Liu, Kun Qian, Zhenhua Li, Tianyin Xu, Yunhao Liu, Weicheng Wang, Yun Zhang, Jiakang Li, Shuhong Zhu, Xue Li, et al. 2025. Skele- tonHunter: Diagnosing and Localizing Network Failures in Container- ized Large Model Training. InProceedings of the ACM SIGCOMM 2025 Conference. 527–540

2025
[33]

Keith Marzullo and Susan Owicki. 1983. Maintaining the time in a distributed system. InProceedings of the second annual ACM symposium on Principles of distributed computing. 295–305

1983
[34]

Avinash Maurya, Robert Underwood, M Mustafa Rafique, Franck Cap- pello, and Bogdan Nicolae. 2024. Datastates-llm: Lazy asynchronous checkpointing for large language models. InProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing. 227–239

2024
[35]

Meta. 2025. Dynolog : a telemetry daemon for performance monitoring and tracing.https://github.com/facebookincubator/dynolog. Accessed August 25, 2025

2025
[36]

Meta. 2025. OPT 175B Training Log.https://github.com/ facebookresearch/metaseq/blob/main/projects/OPT/chronicles/ OPT175B_Logbook.pdf. Accessed August 25, 2025

2025
[37]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized pipeline parallelism for DNN training. InProceedings of the 27th ACM symposium on operating sys- tems principles. 1–15

2019
[38]

Thanh-Dat Nguyen, Haoye Tian, Bach Le, Patanamon Thongtanunam, and Shane McIntosh. 2025. A Systematic Survey on Debugging Tech- niques for Machine Learning Systems.arXiv preprint arXiv:2503.03158 (2025)

work page arXiv 2025
[39]

NVIDIA. 2025. Collective Communication Protocol.https://docs. nvidia.com/deeplearning/nccl/user-guide/docs/env.html. Accessed August 25, 2025

2025
[40]

NVIDIA. 2025. NCCL RAS.https://docs.nvidia.com/deeplearning/nccl/ user-guide/docs/troubleshooting/ras.html. Accessed August 25, 2025

2025
[41]

NVIDIA. 2025. Nccl-tests.https://github.com/NVIDIA/nccl-tests. Ac- cessed August 25, 2025

2025
[42]

NVIDIA. 2025. NVIDIA Nsight Compute.https://docs.nvidia.com/ nsight-compute/NsightCompute/index.html. Accessed August 25, 2025

2025
[43]

Guilherme Penedo, Hynek Kydlíček, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf, et al. 2024. The fineweb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557(2024)

work page internal anchor Pith review arXiv 2024
[44]

Sreeram Potluri, Khaled Hamidouche, Akshay Venkatesh, Devendar Bureddy, and Dhabaleswar K Panda. 2013. Efficient inter-node MPI communication using GPUDirect RDMA for InfiniBand clusters with NVIDIA GPUs. In2013 42nd International Conference on Parallel Pro- cessing. IEEE, 80–89

2013
[45]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He
[46]

InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Zero: Memory optimizations toward training trillion param- eter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16
[47]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi- billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053(2019)

work page internal anchor Pith review arXiv 2019
[48]

Cheng Tan, Ze Jin, Chuanxiong Guo, Tianrong Zhang, Haitao Wu, Karl Deng, Dongming Bi, and Dong Xiang. 2019. NetBouncer: Active device and link failure localization in data center networks. In16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 599–614

2019
[49]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model.https:// github.com/tatsu-lab/stanford_alpaca

2023
[50]

DLRover Team. 2025. DLRover.https://github.com/intelligent- machine-learning/dlrover. Accessed August 25, 2025

2025
[51]

Ling Team, Binwei Zeng, Chao Huang, Chao Zhang, Changxin Tian, Cong Chen, Dingnan Jin, Feng Yu, Feng Zhu, Feng Yuan, et al. 2025. Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs.arXiv preprint arXiv:2503.05139(2025)

work page arXiv 2025
[52]

Torch Team. 2025. Pytorch Watchdog.https://pytorch.org/docs/stable/ torch_nccl_environment_variables.html. Accessed August 25, 2025

2025
[53]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma- hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023)

work page internal anchor Pith review arXiv 2023
[54]

Didem Unat. 2022. Monitoring Collective Communication Among GPUs. InEuro-Par 2021: Parallel Processing Workshops: Euro-Par 2021 International Workshops, Lisbon, Portugal, August 30-31, 2021, Revised Selected Papers, Vol. 13098. Springer Nature, 41

2022
[55]

A Vaswani. 2017. Attention is all you need.Advances in Neural Information Processing Systems(2017)

2017
[56]

Boxiang Wang, Qifan Xu, Zhengda Bian, and Yang You. 2022. Tesseract: Parallelize the tensor parallelism efficiently. InProceedings of the 51st International Conference on Parallel Processing. 1–11

2022
[57]

Yuxin Wang, Shaohuai Shi, Xin He, Zhenheng Tang, Xinglin Pan, Yang Zheng, Xiaoyu Wu, Amelie Chi Zhou, Bingsheng He, and Xiaowen Chu. 2023. Reliable and efficient in-memory fault tolerance of large language model pretraining.arXiv preprint arXiv:2310.12670(2023)

work page arXiv 2023
[58]

Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, TS Eu- gene Ng, and Yida Wang. 2023. Gemini: Fast failure recovery in dis- tributed training with in-memory checkpoints. InProceedings of the 29th Symposium on Operating Systems Principles. 364–381

2023
[59]

Tianyuan Wu, Wei Wang, Yinghao Yu, Siran Yang, Wenchao Wu, Qinkai Duan, Guodong Yang, Jiamang Wang, Lin Qu, and Liping Zhang
[60]

In2025 USENIX Annual Technical Conference (USENIX ATC 25)

GREYHOUND: Hunting Fail-Slows in Hybrid-Parallel Training at Scale. In2025 USENIX Annual Technical Conference (USENIX ATC 25). 731–747
[61]

Yifan Xiong, Yuting Jiang, Ziyue Yang, Lei Qu, Guoshuai Zhao, Shuguang Liu, Dong Zhong, Boris Pinzur, Jie Zhang, Yang Wang, et al. 2024. SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation. In2024 USENIX Annual Technical Conference (USENIX ATC 24). 835–850

2024
[62]

Yiwen Zhang, Yue Tan, Brent Stephens, and Mosharaf Chowdhury
[63]

In19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)

Justitia: Software Multi-Tenancy in Hardware Kernel-Bypass Networks. In19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 1307–1326
[64]

Hairui Zhao, Hongliang Li, Qi Tian, Jie Wu, Meng Zhang, Zhewen Xu, Xiang Li, and Haixiao Xu. 2025. ArrayPipe: Introducing Job-Array Pipeline Parallelism for High Throughput Model Exploration. InIEEE INFOCOM 2025-IEEE Conference on Computer Communications. IEEE, 1–10

2025
[65]

Hairui Zhao, Qi Tian, Hongliang Li, and Zizhong Chen. 2025. {FlexPipe}: Maximizing training efficiency for transformer-based mod- els with {Variable-Length} inputs. In2025 USENIX Annual Technical Conference (USENIX ATC 25). 143–159

2025
[66]

Jingyuan Zhao, Wenyi Zhao, Bo Deng, Zhenghong Wang, Feng Zhang, Wenxiang Zheng, Wanke Cao, Jinrui Nan, Yubo Lian, and Andrew F Burke. 2024. Autonomous driving system: A comprehensive survey. PPoPP ’26, January 31 – February 4, 2026, Sydney, NSW, Australia Gu and Wang et al. Expert Systems with Applications242 (2024), 122836

2024
[67]

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al
[68]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277(2023). Received 2025-08-23; accepted 2025-11-10

work page internal anchor Pith review arXiv 2023