pith. machine review for the scientific record. sign in

arxiv: 2605.04478 · v1 · submitted 2026-05-06 · 💻 cs.DC · cs.AI

Recognition: unknown

CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training

Dingwen Tao, Fakang Wang, Feng Yu, Guangming Tan, Hairui Zhao, Haoxu Li, Jianhao Fu, Jinwu Yang, Qianyu Zhang, Qian Zhao, Tao Wang, Wenjing Huang, Xingchen Liu, Yang Tian, Yida Gu, Yifan Chen, Yueyuan Zhou, Zedong Liu, Zhan Wang, Zhenhang Sun

Pith reviewed 2026-05-08 17:36 UTC · model grok-4.3

classification 💻 cs.DC cs.AI
keywords anomaly detectioncollective communicationslow hang anomaliesdistributed trainingGPU clustersroot cause analysislarge-scale systemsdiagnostic tools
0
0 comments X

The pith

CCL-D detects and locates slow or hang anomalies in large-scale model training by combining a lightweight real-time probe with an intelligent analyzer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large-scale training of AI models on thousands of GPUs frequently encounters slow or hung collective communication that stalls progress for long periods. Traditional diagnostic methods often require hours or days to identify the root cause amid complex hardware and software interactions. CCL-D addresses this by deploying a rank-level probe to measure cross-layer metrics through lightweight tracing and pairing it with an automated analyzer that detects anomalies and pinpoints the faulty GPU rank. If the system performs as claimed, training clusters would recover from these disruptions far more quickly, cutting downtime in massive distributed environments. The authors report that a year-long deployment on a 4,000-GPU cluster achieved near-complete coverage of known anomalies and located issues within 6 minutes.

Core claim

CCL-D integrates a rank-level real-time probe that measures cross-layer anomaly metrics using a lightweight distributed tracing framework to monitor communication traffic with an intelligent decision analyzer that performs automated anomaly detection and root-cause location to precisely identify the faulty GPU rank. When deployed on a 4,000-GPU cluster over one year, the system achieved near-complete coverage of known slow and hang anomalies and pinpointed affected ranks within 6 minutes, substantially outperforming existing solutions.

What carries the argument

The rank-level real-time probe paired with the intelligent decision analyzer, which together track cross-layer metrics in real time and automate detection plus localization of slow or hang issues.

Load-bearing premise

The cross-layer metrics collected by the lightweight probe are sufficient for the analyzer to distinguish all slow and hang root causes from normal variation without missing novel anomalies or generating high false positives.

What would settle it

Running CCL-D on a comparable large cluster during a documented slow or hang anomaly and finding that it either misses the event, mislocates the faulty rank, or takes longer than six minutes to report the issue would challenge the central performance claims.

Figures

Figures reproduced from arXiv: 2605.04478 by Dingwen Tao, Fakang Wang, Feng Yu, Guangming Tan, Hairui Zhao, Haoxu Li, Jianhao Fu, Jinwu Yang, Qianyu Zhang, Qian Zhao, Tao Wang, Wenjing Huang, Xingchen Liu, Yang Tian, Yida Gu, Yifan Chen, Yueyuan Zhou, Zedong Liu, Zhan Wang, Zhenhang Sun.

Figure 1
Figure 1. Figure 1: Training interruptions and slow/hang root-causes. Motivation. While considerable efforts have focused on fault tolerance mechanisms—such as checkpointing [13, 24, 29, 52]—these solutions primarily aim to reduce recovery la￾tency. However, they fall short of addressing the root causes of the widening reliability gap in large-scale training. Simply restarting failed tasks without identifying and addressing t… view at source ↗
Figure 2
Figure 2. Figure 2: Position of CCL in training and its hierarchical structure. These gaps in diagnostic accuracy and efficiency have be￾come a critical bottleneck in advancing the reliability of large￾scale distributed training systems. Our Solution. To address these challenges, we propose CCL-D , a diagnostic system capable of automatically de￾tecting and precisely locating slow/hang anomalies within minutes. The core idea … view at source ↗
Figure 5
Figure 5. Figure 5: Metrics of CCL-D and corresponding anomaly types. information layer serve both as a means for communication traffic identification and as a foundation for basic diagnosis. 4.1.1 Analysis of Kernel-level Metrics. To improve di￾agnostic accuracy while minimizing system complexity, it is essential to identify metrics that capture the root causes of Hang/Slow anomalies, thereby reducing reliance on redun￾dant … view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of SendRate between normal and slow ranks. can be used to estimate kernel Duration Time per rank to make an initial judgment, this approach is limited by NTP clock drift [6, 28] and the millisecond-to-microsecond scale of collective operations, making the time-based diagnosis inaccurate. Moreover, SendCount/RecvCount fail to capture such anomalies, as final counts may remain consistent de￾spite … view at source ↗
Figure 8
Figure 8. Figure 8: The structure of Trace ID. 𝑃 → 0. Conceptually, 𝑃 = 0 corresponds to purely slow com￾munication, 𝑃 = 1 to purely slow computation, and 𝑃 = 0.5 to an equal contribution of both. However, in large-scale clus￾ters, the equal-contribution scenario rarely occurs. To empha￾size the dominant anomaly type, we introduce two boundary parameters 𝛼 and 𝛽 around 0.5 (e.g., 𝛼 = 0.4, 𝛽 = 0.6). When 𝑃 > 𝛽, the anomaly con… view at source ↗
Figure 7
Figure 7. Figure 7: Decision tree of root-cause ranks. H1-H3 and S1-S3 correspond to the hang/slow discussed in Section 2.2. them. In large-scale clusters, anomalies may propagate across ranks, making the true root-cause rank difficult to isolate. To address this, the location module analyzes kernel metrics from all ranks and applies distinct algorithms to separate root-cause from secondary effects view at source ↗
Figure 9
Figure 9. Figure 9: The structure of Probing Frame. flag (indicating whether metric measurement is enabled), the kernelIndex, and the number of communication channels. The body is cyclically partitioned into blocks, with kernelIn￾dex specifying the block position for the current operation (computed as counter modulo the number of blocks). Within each round of communication, every channel is assigned two consecutive slots to r… view at source ↗
Figure 11
Figure 11. Figure 11: Communication traffic identification overhead and CPU usage per node for anomaly diagnosis at different GPU scales view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of normalized communication time (vs. Original) for different operations on 16–128 GPUs. with Section 5.1, where the probing frame consists of a 32- byte shared header and a 1152-byte body divided into 8 sub-blocks, each corresponding to one Trace ID and mapped to 8 communication channels. Without using the reserved bytes, each Trace ID occupies 16 Bytes. The memory usage of RAS and C4D could n… view at source ↗
Figure 13
Figure 13. Figure 13: Per-step time and loss over time under different large models. others, as it measures fine-grained metrics inside GPU ker￾nels to ensure higher diagnostic accuracy. Although RAS achieves lower overhead by only recording operation counts, its limited anomaly coverage is unacceptable. 6.3.3 Training Efficiency and Accuracy. We test 4 mod￾els with FSDP or 3D parallelism. Llama2-7B and Llama3.1-8B are trained… view at source ↗
read the original abstract

As training scales grow, collective communication libraries (CCL) increasingly face anomalies arising from complex interactions among hardware, software, and environmental factors. These anomalies typically manifest as slow/hang communication, the most frequent and time-consuming category to diagnose. However, traditional diagnostic methods remain inaccurate and inefficient, frequently requiring hours or even days for root cause analysis. To address this, we propose CCL-D, a high-precision diagnostic system designed to detect and locate slow/hang anomalies in large-scale distributed training. CCL-D integrates a rank-level real-time probe with an intelligent decision analyzer. The probe measures cross-layer anomaly metrics using a lightweight distributed tracing framework to monitor communication traffic. The analyzer performs automated anomaly detection and root-cause location, precisely identifying the faulty GPU rank. Deployed on a 4,000-GPU cluster over one year, CCL-D achieved near-complete coverage of known slow/hang anomalies and pinpointed affected ranks within 6 minutes-substantially outperforming existing solutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents CCL-D, a diagnostic system for slow/hang anomalies in collective communication libraries during large-scale model training. It combines a rank-level real-time probe that collects cross-layer anomaly metrics via a lightweight distributed tracing framework with an intelligent decision analyzer that performs automated detection and identifies the faulty GPU rank. The system was deployed on a 4,000-GPU cluster for one year and is reported to have achieved near-complete coverage of known anomalies while localizing affected ranks within 6 minutes, substantially outperforming existing solutions.

Significance. If the performance claims hold under rigorous evaluation, CCL-D would address a critical practical bottleneck in large-scale distributed training by reducing anomaly diagnosis time from hours or days to minutes. The year-long deployment on a production-scale cluster constitutes real-world evidence of utility in the cs.DC domain, though the absence of supporting metrics limits the ability to gauge its broader impact on training reliability and efficiency.

major comments (1)
  1. [Abstract] Abstract: The central claims of 'near-complete coverage of known slow/hang anomalies' and localization 'within 6 minutes' are asserted without any reported quantitative accuracy metrics, false-positive rates, baseline comparisons to existing diagnostic tools, or details on the decision logic inside the intelligent analyzer. These omissions are load-bearing because the soundness of the 'high-precision' and 'substantially outperforming' assertions rests entirely on the unverified deployment outcomes.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We have addressed the major comment point by point below, making revisions where feasible while being transparent about the constraints of our production deployment evaluation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of 'near-complete coverage of known slow/hang anomalies' and localization 'within 6 minutes' are asserted without any reported quantitative accuracy metrics, false-positive rates, baseline comparisons to existing diagnostic tools, or details on the decision logic inside the intelligent analyzer. These omissions are load-bearing because the soundness of the 'high-precision' and 'substantially outperforming' assertions rests entirely on the unverified deployment outcomes.

    Authors: We agree that the original abstract presented the deployment outcomes in summary form without sufficient quantitative backing or methodological detail, which weakens the verifiability of the claims. The figures derive from post-hoc analysis of all slow/hang incidents logged over the year-long run on the 4,000-GPU cluster, where CCL-D identified every anomaly that was later confirmed by operators. In the revision we will expand the abstract to include concrete supporting numbers (e.g., total incidents processed, mean and 95th-percentile localization latency, and a one-sentence outline of the analyzer's hybrid rule-plus-model decision procedure). We have also inserted a short limitations paragraph noting that, because the system ran in live production without parallel execution of alternative tools or exhaustive ground-truth labeling, formal false-positive rates and head-to-head baselines are not available from this deployment. These points are now explicitly stated rather than left implicit. revision: partial

standing simulated objections not resolved
  • We cannot supply baseline comparisons or false-positive rates because the evaluation occurred in an uninterrupted production environment; running competing diagnostic systems in parallel or obtaining independent labels for every event would have required halting training jobs, which was not feasible.

Circularity Check

0 steps flagged

No significant circularity; engineering system with empirical deployment support

full rationale

The paper presents CCL-D as an engineering diagnostic system combining a lightweight probe for cross-layer metrics with an intelligent analyzer for anomaly detection and root-cause localization. Its central claims rest on a year-long deployment across a 4,000-GPU cluster that reports near-complete coverage of known slow/hang cases and 6-minute rank localization. No mathematical derivations, equations, parameter fittings, predictions from first principles, or self-citation chains appear in the provided text. The evaluation is purely empirical and externally falsifiable via the reported deployment outcomes, with no reduction of results to inputs by construction. This is the expected non-finding for a systems paper without theoretical modeling.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

As an applied systems paper, the work relies on standard assumptions from distributed computing and monitoring rather than new mathematical axioms or fitted parameters.

axioms (2)
  • domain assumption Lightweight distributed tracing can capture cross-layer communication metrics with negligible overhead in large clusters.
    Invoked to justify the real-time probe design.
  • domain assumption Cross-layer anomaly metrics are sufficient for automated detection and precise root-cause localization of slow/hang events.
    Basis for the intelligent decision analyzer.
invented entities (1)
  • CCL-D diagnostic system no independent evidence
    purpose: High-precision detection and localization of slow/hang anomalies
    The proposed integrated probe-plus-analyzer framework itself.

pith-pipeline@v0.9.0 · 5536 in / 1243 out tokens · 62301 ms · 2026-05-08T17:36:18.992553+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 12 canonical work pages · 6 internal anchors

  1. [1]

    Palwisha Akhtar, Erhan Tezcan, Fareed Mohammad Qararyah, and Didem Unat. 2020. ComScribe: identifying intra-node GPU communi- cation. InInternational Symposium on Benchmarking, Measuring and Optimization. Springer, 157–174

  2. [2]

    AMD. 2025. RCCL: ROCm Communication Collectives Library.https: //github.com/ROCm/rccl. Accessed August 25, 2025

  3. [3]

    BigScience. 2025. BLOOM 176B Training Log.https://github. com/bigscience-workshop/bigscience/blob/master/train/tr11-176B- ml/chronicles.md. Accessed August 25, 2025

  4. [4]

    Sanghun Cho, Hyojun Son, and John Kim. 2023. Logical/physical topology-aware collective communication in deep learning training. In2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 56–68

  5. [5]

    Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky. 2021. NVIDIA a100 tensor core gpu: Performance and innovation.IEEE Micro41, 2 (2021), 29–35

  6. [6]

    James C Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christo- pher Frost, Jeffrey John Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, et al. 2013. Spanner: Google’s globally distributed database.ACM Transactions on Computer Systems (TOCS)31, 3 (2013), 1–22

  7. [7]

    Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, et al. 2024. A survey on multimodal large language models for autonomous driv- ing. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 958–979

  8. [8]

    Weihao Cui, Ji Zhang, Han Zhao, Chao Liu, Wenhao Zhang, Jian Sha, Quan Chen, Bingsheng He, and Minyi Guo. 2025. XPUTimer: Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand- Plus Scale.arXiv preprint arXiv:2502.05413(2025)

  9. [9]

    Huangliang Dai, Shixun Wu, Jiajun Huang, Zizhe Jian, Yue Zhu, Haiyang Hu, and Zizhong Chen. 2025. FT-Transformer: Resilient and reliable transformer with end-to-end fault tolerant attention. In Proceedings of the International Conference for High Performance Com- puting, Networking, Storage and Analysis. 1085–1098

  10. [10]

    Yangtao Deng, Xiang Shi, Zhuo Jiang, Xingjian Zhang, Lei Zhang, Zhang Zhang, Bo Li, Zuquan Song, Hang Zhu, Gaohong Liu, et al

  11. [11]

    In22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25)

    Minder: Faulty machine detection for large-scale distributed model training. In22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25). 505–521

  12. [12]

    Jianbo Dong, Bin Luo, Jun Zhang, Pengcheng Zhang, Fei Feng, Yikai Zhu, Ang Liu, Zian Chen, Yi Shi, Hairong Jiao, et al. 2025. Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1246–1258

  13. [13]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Ka- dian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

  14. [14]

    Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, and Murali Annavaram. 2022. Check-N-Run: A check- pointing system for training deep learning recommendation models. In19th USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 22). 929–943

  15. [15]

    Yanjie Gao, Jiyu Luo, Haoxiang Lin, Hongyu Zhang, Ming Wu, and Mao Yang. 2025. dl 2: Detecting Communication Deadlocks in Deep Learning Jobs. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 27–38

  16. [16]

    Jiayi Huang, Pritam Majumder, Sungkeun Kim, Abdullah Muzahid, Ki Hwan Yum, and Eun Jung Kim. 2021. Communication algorithm- architecture co-design for distributed deep learning. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 181–194

  17. [17]

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems32 (2019)

  18. [18]

    Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Jun- jie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of Large- Scale Multi-Tenant GPU clusters for DNN training workloads. In2019 USENIX Annual Technical Conference (USENIX ATC 19). 947–960

  19. [19]

    Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanx- iong Guo. 2020. A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 463–479

  20. [20]

    Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al

  21. [21]

    In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)

    MegaScale: Scaling large language model training to more than 10,000 GPUs. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 745–760

  22. [22]

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Ben- jamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361(2020)

  23. [23]

    Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova

  24. [24]

    InProceedings of naacL-HLT, Vol

    Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of naacL-HLT, Vol. 1. Minneapolis, Minnesota

  25. [25]

    Hongbo Li, Zizhong Chen, and Rajiv Gupta. 2017. Parastack: Efficient hang detection for mpi programs at large scale. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–12

  26. [26]

    Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. 2020. Pytorch distributed: Experiences on accelerating data parallel training.arXiv preprint arXiv:2006.15704(2020)

  27. [27]

    Wenshuo Li, Xinghao Chen, Han Shu, Yehui Tang, and Yunhe Wang

  28. [28]

    CCL-D PPoPP ’26, January 31 – February 4, 2026, Sydney, NSW, Australia

    ExCP: Extreme LLM Checkpoint Compression via Weight- Momentum Joint Shrinking.arXiv preprint arXiv:2406.11257(2024). CCL-D PPoPP ’26, January 31 – February 4, 2026, Sydney, NSW, Australia

  29. [29]

    Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, and Gedas Bertasius

  30. [30]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Vision transformers are parameter-efficient audio-visual learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2299–2309

  31. [31]

    Kefei Liu, Zhuo Jiang, Jiao Zhang, Haoran Wei, Xiaolong Zhong, Lizhuang Tan, Tian Pan, and Tao Huang. 2023. Hostping: Diagnos- ing intra-host network bottlenecks in RDMA servers. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 15–29

  32. [32]

    Wei Liu, Kun Qian, Zhenhua Li, Tianyin Xu, Yunhao Liu, Weicheng Wang, Yun Zhang, Jiakang Li, Shuhong Zhu, Xue Li, et al. 2025. Skele- tonHunter: Diagnosing and Localizing Network Failures in Container- ized Large Model Training. InProceedings of the ACM SIGCOMM 2025 Conference. 527–540

  33. [33]

    Keith Marzullo and Susan Owicki. 1983. Maintaining the time in a distributed system. InProceedings of the second annual ACM symposium on Principles of distributed computing. 295–305

  34. [34]

    Avinash Maurya, Robert Underwood, M Mustafa Rafique, Franck Cap- pello, and Bogdan Nicolae. 2024. Datastates-llm: Lazy asynchronous checkpointing for large language models. InProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing. 227–239

  35. [35]

    Meta. 2025. Dynolog : a telemetry daemon for performance monitoring and tracing.https://github.com/facebookincubator/dynolog. Accessed August 25, 2025

  36. [36]

    Meta. 2025. OPT 175B Training Log.https://github.com/ facebookresearch/metaseq/blob/main/projects/OPT/chronicles/ OPT175B_Logbook.pdf. Accessed August 25, 2025

  37. [37]

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized pipeline parallelism for DNN training. InProceedings of the 27th ACM symposium on operating sys- tems principles. 1–15

  38. [38]

    Thanh-Dat Nguyen, Haoye Tian, Bach Le, Patanamon Thongtanunam, and Shane McIntosh. 2025. A Systematic Survey on Debugging Tech- niques for Machine Learning Systems.arXiv preprint arXiv:2503.03158 (2025)

  39. [39]

    NVIDIA. 2025. Collective Communication Protocol.https://docs. nvidia.com/deeplearning/nccl/user-guide/docs/env.html. Accessed August 25, 2025

  40. [40]

    NVIDIA. 2025. NCCL RAS.https://docs.nvidia.com/deeplearning/nccl/ user-guide/docs/troubleshooting/ras.html. Accessed August 25, 2025

  41. [41]

    NVIDIA. 2025. Nccl-tests.https://github.com/NVIDIA/nccl-tests. Ac- cessed August 25, 2025

  42. [42]

    NVIDIA. 2025. NVIDIA Nsight Compute.https://docs.nvidia.com/ nsight-compute/NsightCompute/index.html. Accessed August 25, 2025

  43. [43]

    Guilherme Penedo, Hynek Kydlíček, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf, et al. 2024. The fineweb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557(2024)

  44. [44]

    Sreeram Potluri, Khaled Hamidouche, Akshay Venkatesh, Devendar Bureddy, and Dhabaleswar K Panda. 2013. Efficient inter-node MPI communication using GPUDirect RDMA for InfiniBand clusters with NVIDIA GPUs. In2013 42nd International Conference on Parallel Pro- cessing. IEEE, 80–89

  45. [45]

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

  46. [46]

    InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis

    Zero: Memory optimizations toward training trillion param- eter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16

  47. [47]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi- billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053(2019)

  48. [48]

    Cheng Tan, Ze Jin, Chuanxiong Guo, Tianrong Zhang, Haitao Wu, Karl Deng, Dongming Bi, and Dong Xiang. 2019. NetBouncer: Active device and link failure localization in data center networks. In16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 599–614

  49. [49]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model.https:// github.com/tatsu-lab/stanford_alpaca

  50. [50]

    DLRover Team. 2025. DLRover.https://github.com/intelligent- machine-learning/dlrover. Accessed August 25, 2025

  51. [51]

    Ling Team, Binwei Zeng, Chao Huang, Chao Zhang, Changxin Tian, Cong Chen, Dingnan Jin, Feng Yu, Feng Zhu, Feng Yuan, et al. 2025. Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs.arXiv preprint arXiv:2503.05139(2025)

  52. [52]

    Torch Team. 2025. Pytorch Watchdog.https://pytorch.org/docs/stable/ torch_nccl_environment_variables.html. Accessed August 25, 2025

  53. [53]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma- hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023)

  54. [54]

    Didem Unat. 2022. Monitoring Collective Communication Among GPUs. InEuro-Par 2021: Parallel Processing Workshops: Euro-Par 2021 International Workshops, Lisbon, Portugal, August 30-31, 2021, Revised Selected Papers, Vol. 13098. Springer Nature, 41

  55. [55]

    A Vaswani. 2017. Attention is all you need.Advances in Neural Information Processing Systems(2017)

  56. [56]

    Boxiang Wang, Qifan Xu, Zhengda Bian, and Yang You. 2022. Tesseract: Parallelize the tensor parallelism efficiently. InProceedings of the 51st International Conference on Parallel Processing. 1–11

  57. [57]

    Yuxin Wang, Shaohuai Shi, Xin He, Zhenheng Tang, Xinglin Pan, Yang Zheng, Xiaoyu Wu, Amelie Chi Zhou, Bingsheng He, and Xiaowen Chu. 2023. Reliable and efficient in-memory fault tolerance of large language model pretraining.arXiv preprint arXiv:2310.12670(2023)

  58. [58]

    Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, TS Eu- gene Ng, and Yida Wang. 2023. Gemini: Fast failure recovery in dis- tributed training with in-memory checkpoints. InProceedings of the 29th Symposium on Operating Systems Principles. 364–381

  59. [59]

    Tianyuan Wu, Wei Wang, Yinghao Yu, Siran Yang, Wenchao Wu, Qinkai Duan, Guodong Yang, Jiamang Wang, Lin Qu, and Liping Zhang

  60. [60]

    In2025 USENIX Annual Technical Conference (USENIX ATC 25)

    GREYHOUND: Hunting Fail-Slows in Hybrid-Parallel Training at Scale. In2025 USENIX Annual Technical Conference (USENIX ATC 25). 731–747

  61. [61]

    Yifan Xiong, Yuting Jiang, Ziyue Yang, Lei Qu, Guoshuai Zhao, Shuguang Liu, Dong Zhong, Boris Pinzur, Jie Zhang, Yang Wang, et al. 2024. SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation. In2024 USENIX Annual Technical Conference (USENIX ATC 24). 835–850

  62. [62]

    Yiwen Zhang, Yue Tan, Brent Stephens, and Mosharaf Chowdhury

  63. [63]

    In19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)

    Justitia: Software Multi-Tenancy in Hardware Kernel-Bypass Networks. In19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 1307–1326

  64. [64]

    Hairui Zhao, Hongliang Li, Qi Tian, Jie Wu, Meng Zhang, Zhewen Xu, Xiang Li, and Haixiao Xu. 2025. ArrayPipe: Introducing Job-Array Pipeline Parallelism for High Throughput Model Exploration. InIEEE INFOCOM 2025-IEEE Conference on Computer Communications. IEEE, 1–10

  65. [65]

    Hairui Zhao, Qi Tian, Hongliang Li, and Zizhong Chen. 2025. {FlexPipe}: Maximizing training efficiency for transformer-based mod- els with {Variable-Length} inputs. In2025 USENIX Annual Technical Conference (USENIX ATC 25). 143–159

  66. [66]

    Jingyuan Zhao, Wenyi Zhao, Bo Deng, Zhenghong Wang, Feng Zhang, Wenxiang Zheng, Wanke Cao, Jinrui Nan, Yubo Lian, and Andrew F Burke. 2024. Autonomous driving system: A comprehensive survey. PPoPP ’26, January 31 – February 4, 2026, Sydney, NSW, Australia Gu and Wang et al. Expert Systems with Applications242 (2024), 122836

  67. [67]

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al

  68. [68]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277(2023). Received 2025-08-23; accepted 2025-11-10