arxiv: 2605.11852 · v2 · submitted 2026-05-12 · 💻 cs.NI

Recognition: unknown

Avoiding Cross-Datacenter Collective Congestion via Disaggregated Buffering

Mariano Scazzariello , Noga H. Rotman , Dima Gavrilenko , Sajy Khashab , Alexander Shpiner , Matty Kadosh , Marco Chiesa , Dejan Kostic

show 1 more author

Mark Silberstein

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:52 UTC · model grok-4.3

classification 💻 cs.NI

keywords cross-datacentercollective communicationcongestion controlLLM trainingin-network bufferingpacket loss preventionmulti-datacenter networksSpillway

0 comments

The pith

Spillway buffers dropped packets at the destination data center to prevent congestion from cross-DC collective collisions in large-scale LLM training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Spillway, a mechanism that uses switch-disaggregated buffers in the destination data center to hold packets that would otherwise be dropped when cross-datacenter collectives collide with local traffic. This addresses the slow reaction time of multi-millisecond congestion control loops that lead to packet loss and collapse. Simulations and a hardware prototype demonstrate that it removes the resulting performance degradation and shortens iteration times by as much as 14 percent. The approach requires no modifications to end hosts or training frameworks, making it immediately applicable to existing multi-DC setups.

Core claim

Spillway is a transparent in-network mechanism that buffers dropped packets in switch-disaggregated buffers in a destination data center and drains them once congestion subsides. Through large-scale end-to-end simulations and a hardware prototype, it eliminates performance degradation from collective collisions, reducing iteration time by up to 14%, without changes to end hosts or training frameworks.

What carries the argument

Switch-disaggregated buffers that temporarily store packets at the destination until local congestion clears, then forward them without requiring host intervention.

If this is right

Iteration times in multi-DC LLM training decrease by up to 14% during collective operations.
Severe packet loss and congestion collapse from colliding traffic are eliminated.
No modifications to end hosts or training frameworks are needed for deployment.
The mechanism handles the delay in congestion control responses that spans multiple milliseconds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar buffering could help other distributed applications that span multiple data centers and rely on collective patterns.
Hardware prototypes suggest the approach may scale to production environments with appropriate buffer sizing.
By offloading recovery to the network, it could complement rather than replace end-to-end congestion control schemes.

Load-bearing premise

Switch-disaggregated buffers can be added transparently to existing hardware, hold enough packets for the bursts involved, and release them without introducing fresh contention points.

What would settle it

Observing no reduction in iteration time or increased packet loss in a scaled hardware test where buffer capacity is reached or draining creates downstream congestion.

Figures

Figures reproduced from arXiv: 2605.11852 by Alexander Shpiner, Dejan Kostic, Dima Gavrilenko, Marco Chiesa, Mariano Scazzariello, Mark Silberstein, Matty Kadosh, Noga H. Rotman, Sajy Khashab.

**Figure 2.** Figure 2: Different approaches with 16 concurrent remote [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Impact of long-haul loss under cross-DC traffic [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the SPILLWAY architecture. would trigger a multi-millisecond retransmission that directly inflates iteration time, while local AllToAll sits on the microbatch critical path and must not be delayed by deflection (see Sec. 2). This asymmetric treatment is also the source of the core difficulty ( [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Cross-DC FCT slowdown vs. ideal under RTO [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Impact of SPILLWAY on training performance. the destination leaf, triggering PFC and delaying the AllToAll. This delay propagates to the pipeline-parallel send stage and subsequent stages, offsetting the latency advantage. We provide a detailed analysis of this interaction in App. A. At 30 ms, our evaluation reveals a broader insight: local compute finishes before the cross-DC transfer, leaving no overla… view at source ↗

**Figure 9.** Figure 9: Spine buffer utilization under extreme congestion. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 8.** Figure 8: reports spillway buffer utilization for the two anycast strategies, normalized to the aggregate capacity (i.e., 512 GB). Utilization remains low in all cases. The Stateless variant drains more aggressively, occasionally triggering small incasts that cause limited re-deflections; however, this effect is 0 2 4 6 8 10 12 14 16 Time (ms) 0.00 0.01 0.02 0.03 Utilization (%) Sticky Stateless (a) DC-Anycast. 0 2… view at source ↗

**Figure 12.** Figure 12: FCT of the lossy flow vs. high-priority burst. [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 11.** Figure 11: Impact of fast CNP feedback on flow behavior. [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 13.** Figure 13: FCT of the lossy flow vs. high-priority burst with [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

**Figure 14.** Figure 14: AllToAll collective is delayed due to overlap with [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

read the original abstract

LLM training at the scale of tens of thousands of GPUs now spans multiple datacenters (DC), making cross-DC collectives over long-haul links unavoidable. A critical and overlooked bottleneck arises when these collectives collide with intra-DC traffic at the destination - a common pattern in real workloads. The multi-millisecond congestion control loop is too slow to react, triggering severe packet loss and congestion collapse. We present Spillway, a transparent in-network mechanism that buffers dropped packets in switch-disaggregated buffers in a destination data center and drains them once congestion subsides. Through large-scale end-to-end simulations and a hardware prototype, we show that Spillway eliminates performance degradation from collective collisions, reducing iteration time by up to 14 %, without changes to end hosts or training frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Spillway gives a workable in-network buffer fix for cross-DC collective collisions in big LLM jobs, with sims and a prototype backing the 14% gain, but buffer scaling at target size is still open.

read the letter

Spillway targets the packet loss that hits when cross-DC collectives slam into intra-DC traffic at the receiving end. The fix is to park the dropped packets in extra switch-disaggregated buffers in the destination DC and release them once the short-term congestion eases, all without touching hosts or the training code. That transparency is the practical part that stands out. The large-scale simulations and hardware prototype both show the iteration-time hit disappearing, with gains up to 14 percent. The numbers come from end-to-end runs rather than just micro-benchmarks, which makes the result easier to take at face value. The mechanism itself is a straightforward extension of existing buffering ideas, tuned to the multi-millisecond round-trip times of long-haul links. The soft spot is capacity and drain behavior at the scale the motivation cites. The prototype is small, and nothing in the reported evidence shows that the required buffer depth stays feasible once GPU counts reach tens of thousands or that draining the stored packets avoids fresh contention on the same links. If the full paper has detailed buffer sizing curves and contention traces at that scale, the concern shrinks; otherwise it stays live. This work is for network engineers and cluster operators who already run multi-DC training and need a drop-in mitigation. It is not a theoretical advance, but the empirical side is solid enough that a serious referee should see it. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces Spillway, a transparent in-network mechanism that uses switch-disaggregated buffers in the destination datacenter to store packets dropped during collisions between cross-DC collectives and intra-DC traffic. It claims that this approach eliminates performance degradation in multi-DC LLM training at tens-of-thousands-of-GPUs scale, reducing iteration time by up to 14% as shown in large-scale end-to-end simulations and a hardware prototype, without requiring changes to end hosts or training frameworks.

Significance. If the empirical results hold at the target scale, Spillway would address a practical and previously overlooked congestion collapse mode in cross-datacenter collective communication, offering a deployable mitigation that preserves existing host and framework stacks. The combination of large-scale simulation and hardware prototype is a clear strength, providing direct evidence rather than purely analytical claims. The reported 14% improvement would be meaningful for production training workloads if the buffer-capacity and transparency assumptions are shown to scale.

major comments (2)

[Abstract and prototype section] Abstract and prototype section: the hardware prototype demonstrates the buffering mechanism at small scale, but the central claim that switch-disaggregated buffers remain feasible for the multi-millisecond bursts at tens-of-thousands-of-GPUs scale is not supported by any capacity analysis or scaling argument; the manuscript provides no evidence that required buffer depth fits in existing switch hardware or that the drain path avoids new contention at the cited scale.
[Simulation evaluation section] Simulation evaluation section: the claim of up to 14% iteration-time reduction rests on end-to-end simulations, yet no detailed baselines, traffic-pattern definitions, buffer-size parameters, or error bars are reported; without these the quantitative support for the performance claim cannot be assessed and the result remains load-bearing for the paper's contribution.

minor comments (2)

[Introduction] Clarify in the introduction how 'switch-disaggregated buffers' are realized in commodity hardware without requiring new switch ASICs or host modifications.
[Evaluation] Add a table or figure caption that explicitly lists the simulation parameters (link bandwidths, buffer depths, collective sizes) used to obtain the 14% figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of Spillway's significance and for the constructive major comments. We address each point below and will make the necessary revisions to strengthen the manuscript's claims with additional analysis and details.

read point-by-point responses

Referee: [Abstract and prototype section] Abstract and prototype section: the hardware prototype demonstrates the buffering mechanism at small scale, but the central claim that switch-disaggregated buffers remain feasible for the multi-millisecond bursts at tens-of-thousands-of-GPUs scale is not supported by any capacity analysis or scaling argument; the manuscript provides no evidence that required buffer depth fits in existing switch hardware or that the drain path avoids new contention at the cited scale.

Authors: We thank the referee for highlighting this gap. While our large-scale simulations implicitly validate the buffer feasibility at the target scale by achieving the reported performance without buffer overflow, we agree that an explicit analysis is necessary. In the revised manuscript, we will include a new subsection on buffer capacity requirements. This will calculate the maximum burst size based on the cross-DC collective traffic patterns at 10k+ GPUs (e.g., deriving multi-ms burst volumes from the simulation parameters) and compare it against typical disaggregated buffer sizes in modern switches (such as 256-512 MB per port in high-end hardware). Additionally, we will explain that the drain path utilizes dedicated low-priority queues to avoid introducing contention with ongoing traffic. revision: yes
Referee: [Simulation evaluation section] Simulation evaluation section: the claim of up to 14% iteration-time reduction rests on end-to-end simulations, yet no detailed baselines, traffic-pattern definitions, buffer-size parameters, or error bars are reported; without these the quantitative support for the performance claim cannot be assessed and the result remains load-bearing for the paper's contribution.

Authors: The referee correctly identifies that the simulation details are insufficiently documented. We will revise the evaluation section to provide: (1) precise definitions of the traffic patterns, including the specific cross-DC collective operations (e.g., all-reduce on model parameters) and their collision with intra-DC flows; (2) the baseline configurations, such as standard RDMA over TCP without Spillway; (3) the buffer sizes employed in the simulations (e.g., 100 MB per disaggregated buffer); and (4) results with error bars from at least 10 independent runs to show variability. These additions will allow readers to fully assess the 14% improvement claim. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on direct empirical measurements from simulation and prototype

full rationale

The paper presents Spillway as an in-network buffering mechanism evaluated through large-scale end-to-end simulations and a hardware prototype. No equations, fitted parameters, derivations, or self-citation chains appear in the provided text. The central performance claim (up to 14% iteration time reduction) is reported as a measured outcome rather than a prediction derived from the mechanism's own inputs. No self-definitional loops, renamed known results, or load-bearing uniqueness theorems are present. The work is self-contained against external benchmarks via direct experimentation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim depends on the practical feasibility of the proposed buffering mechanism rather than on mathematical axioms or fitted parameters; the only background assumptions are standard networking properties such as the speed of congestion-control loops.

axioms (1)

domain assumption Congestion-control reaction time is on the order of multiple milliseconds and therefore too slow for microsecond-scale packet bursts from collectives
Stated directly in the abstract as the root cause of the observed packet loss.

invented entities (1)

Spillway buffering mechanism no independent evidence
purpose: Capture and later release packets dropped at destination-DC switches during collective collisions
New system component introduced by the paper; no independent evidence outside the reported simulations and prototype is provided.

pith-pipeline@v0.9.0 · 5462 in / 1400 out tokens · 58207 ms · 2026-05-14T20:52:36.305954+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

54 extracted references

[1]

Burst-tolerant datacenter networks with vertigo

Sepehr Abdous, Erfan Sharafzadeh, and Soudeh Ghor- bani. Burst-tolerant datacenter networks with vertigo. InProceedings of the 17th International Conference on emerging Networking EXperiments and Technologies, pages 1–15, 2021

2021
[2]

Practical packet deflection in datacenters.Proceed- ings of the ACM on Networking, 1(CoNEXT3):1–25, 2023

Sepehr Abdous, Erfan Sharafzadeh, and Soudeh Ghor- bani. Practical packet deflection in datacenters.Proceed- ings of the ACM on Networking, 1(CoNEXT3):1–25, 2023

2023
[3]

Claude Mythos 5: What the First 10-Trillion-Parameter Model Actually Means for Developers, 2026

AI Magicx Team. Claude Mythos 5: What the First 10-Trillion-Parameter Model Actually Means for Developers, 2026. https://www.aimagicx.com/ blog/claude-mythos-5-trillion-parameter- model-developer-guide-2026

2026
[4]

Microsoft azure delivers the first large scale cluster with NVIDIA GB300 NVL72 for OpenAI workloads,

Rani Borkar and Nidhi Chappell. Microsoft azure delivers the first large scale cluster with NVIDIA GB300 NVL72 for OpenAI workloads,
[5]

https://azure.microsoft.com/en- us/blog/microsoft-azure-delivers-the-first- large-scale-cluster-with-nvidia-gb300- nvl72-for-openai-workloads/
[6]

Syccl: Exploiting symmetry for efficient collective com- munication scheduling

Jiamin Cao, Shangfeng Shi, Jiaqi Gao, Weisen Liu, Yi- fan Yang, Yichi Xu, Zhilong Zheng, Yu Guan, Kun Qian, Ying Liu, Mingwei Xu, Tianshu Wang, Ning Wang, Jianbo Dong, Binzhang Fu, Dennis Cai, and Ennan Zhai. Syccl: Exploiting symmetry for efficient collective com- munication scheduling. SIGCOMM ’25, page 645–662, New York, NY , USA, 2025. Association for...

2025
[7]

Priority Flow Control: Build Reliable Layer 2 Infrastructure , 2015

Cisco. Priority Flow Control: Build Reliable Layer 2 Infrastructure , 2015. https://e2e.ti.com/cfs- file/__key/communityserver-discussions- components-files/908/802.1q-Flow-Control- white_5F00_paper_5F00_c11_2D00_542809.pdf

2015
[8]

xAI’s Colos- sus 2 - First Gigawatt Datacenter In The World, Unique RL Methodology, Capital Raise, 2025

Jeremie Eliahou Ontiveros, Dylan Patel, Wei Zhou, AJ Kourabi, and Maya Barkin. xAI’s Colos- sus 2 - First Gigawatt Datacenter In The World, Unique RL Methodology, Capital Raise, 2025. http://semianalysis.com/p/xais-colossus-2- first-gigawatt-datacenter

2025
[9]

Rdma over ethernet for distributed training at meta scale

Adithya Gangidi, Rui Miao, Shengbao Zheng, Sai Jayesh Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Mohammad Riftadi, Ashmitha Jeevaraj Shetty, Jingyi Yang, Shuqiang Zhang, Mikel Jimenez Fernandez, Shashidhar Gandham, and Hongyi Zeng. Rdma over ethernet for distributed training at meta scale. ACM SIGCOMM ’24, page 57–70, New York, NY , USA, 2024. Associa...

2024
[10]

Flowmoe: A scalable pipeline scheduling framework for distributed mixture-of-experts training, 2025

Yunqi Gao, Bing Hu, Mahdi Boloursaz Mashhadi, A- Long Jin, Yanfeng Zhang, Pei Xiao, Rahim Tafazolli, and Merouane Debbah. Flowmoe: A scalable pipeline scheduling framework for distributed mixture-of-experts training, 2025

2025
[11]

I’ve got 99 problems but flops ain’t one

Alexandru M Gherghescu, Vlad-Andrei B˘adoiu, Alexan- dru Agache, Mihai-Valentin Dumitru, Iuliu Vasilescu, Radu Mantu, and Costin Raiciu. I’ve got 99 problems but flops ain’t one. InProceedings of the 23rd ACM Workshop on Hot Topics in Networks, pages 195–204, 2024

2024
[12]

Infinite scale: The architec- ture behind the Azure AI superfactory, 2025

Scott Guthrie. Infinite scale: The architec- ture behind the Azure AI superfactory, 2025. https://blogs.microsoft.com/blog/2025/11/ 12/infinite-scale-the-architecture-behind- the-azure-ai-superfactory/

2025
[13]

Inter-data center rdma: Challenges, status, and future directions.Future Internet, 17(6), 2025

Xiaoying Huang and Jingwei Wang. Inter-data center rdma: Challenges, status, and future directions.Future Internet, 17(6), 2025

2025
[14]

Energy consumption in parallel neural network training, 2025

Philipp Huber, David Li, Juan Pedro Gutiérrez Her- mosillo Muriedas, Deifilia Kieckhefen, Markus Götz, Achim Streit, and Charlotte Debus. Energy consumption in parallel neural network training, 2025

2025
[15]

InfiniBand Trade Association.InfiniBand Architecture Specification, Volume 1, Release 1.5, 2021

2021
[16]

Accelerating distributed MoE training and inference with lina

Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. Accelerating distributed MoE training and inference with lina. In2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 945–959, Boston, MA, July 2023. USENIX Association

2023
[17]

Hanks, David Meyer, and Paul S

Tony Li, Dino Farinacci, Stanley P. Hanks, David Meyer, and Paul S. Traina. Generic Routing Encapsulation (GRE). RFC 2784, March 2000

2000
[18]

Revisiting rdma reliability for lossy fabrics

Wenxue Li, Xiangzhou Liu, Yunxuan Zhang, Zihao Wang, Wei Gu, Tao Qian, Gaoxiong Zeng, Shoushou Ren, Xinyang Huang, Zhenghang Ren, et al. Revisiting rdma reliability for lossy fabrics. InProceedings of the ACM SIGCOMM 2025 Conference, pages 85–98, 2025

2025
[19]

Hpcc: high precision congestion control

Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Alizadeh, and Minlan Yu. Hpcc: high precision congestion control. InProceed- ings of the ACM Special Interest Group on Data Com- munication, SIGCOMM ’19, page 44–58, New York, NY , USA, 2019. Association for Computing Machinery

2019
[20]

Handling future congestion in cross-datacenter rdma networks

Peiyuan Lin, Shuo Wang, Dong Zhou, Siyu Han, Chil- iang Zhong, Yupeng Liang, and Tao Huang. Handling future congestion in cross-datacenter rdma networks. In 13 2025 Thirteenth International Conference on Advanced Cloud and Big Data (CBD), pages 7–12, 2025

2025
[21]

Rethinking machine learn- ing collective communication as a multi-commodity flow problem

Xuting Liu, Behnaz Arzani, Siva Kesava Reddy Kakarla, Liangyu Zhao, Vincent Liu, Miguel Castro, Srikanth Kandula, and Luke Marshall. Rethinking machine learn- ing collective communication as a multi-commodity flow problem. InProceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM ’24, page 16–37, New York, NY , USA, 2024. Association for Computing Machinery

2024
[22]

Multi-Path transport for RDMA in datacenters

Yuanwei Lu, Guo Chen, Bojie Li, Kun Tan, Yongqiang Xiong, Peng Cheng, Jiansong Zhang, Enhong Chen, and Thomas Moscibroda. Multi-Path transport for RDMA in datacenters. In15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pages 357–371, Renton, WA, April 2018. USENIX Associa- tion

2018
[23]

Load balancing for ai training workloads, 2026

Sarah McClure, Evyatar Cohen, Alex Shpiner, Mark Sil- berstein, Sylvia Ratnasamy, Scott Shenker, and Isaac Keslassy. Load balancing for ai training workloads, 2026

2026
[24]

Revisiting network support for rdma

Radhika Mittal, Alexander Shpiner, Aurojit Panda, Eitan Zahavi, Arvind Krishnamurthy, Sylvia Ratnasamy, and Scott Shenker. Revisiting network support for rdma. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pages 313–326, 2018

2018
[25]

Efficient large-scale language model training on gpu clusters using megatron-lm

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. InProceedings of the Inter- national Conference for High Perfo...

2021
[26]

Themis: Addressing congestion-induced unfairness in long-haul rdma net- works

Zihan Niu, Menghao Zhang, Jue Zhang, Renjie Xie, Yuan Yang, and Xiaohe Hu. Themis: Addressing congestion-induced unfairness in long-haul rdma net- works. In2025 IEEE 33rd International Conference on Network Protocols (ICNP), pages 1–13. IEEE, 2025

2025
[27]

NVIDIA InfiniBand Adaptive Rout- ing Technology, 2023

NVIDIA. NVIDIA InfiniBand Adaptive Rout- ing Technology, 2023. https://storage. ghost.io/c/35/17/35170502-dfe4-4f36-9612- bdc657f28241/content/files/2023/12/NVIDIA_ InfiniBand_Adaptive_Routing_Technology_ Insights_Whitepaper.pdf

2023
[28]

NVIDIA Spectrum-X White Paper,

NVIDIA. NVIDIA Spectrum-X White Paper,
[29]

https://resources.nvidia.com/en- us-accelerated-networking-resource- library/nvidia-spectrum-x
[30]

Turbocharge LLM Training across Long-Haul Data Center Networks with NVIDIA NeMo Framework,

NVIDIA. Turbocharge LLM Training across Long-Haul Data Center Networks with NVIDIA NeMo Framework,
[31]

https://developer.nvidia.com/blog/ turbocharge-llm-training-across-long-haul- data-center-networks-with-nvidia-nemo- framework/
[32]

NVIDIA DGX B200, 2025

NVIDIA. NVIDIA DGX B200, 2025. https://www. nvidia.com/en-us/data-center/dgx-b200/

2025
[33]

NVIDIA DGX SuperPOD: Reference architecture – network fabrics, 2025

NVIDIA. NVIDIA DGX SuperPOD: Reference architecture – network fabrics, 2025. https: //docs.nvidia.com/dgx-superpod/reference- architecture-scalable-infrastructure- b200/latest/network-fabrics.html

2025
[34]

NVIDIA Spectrum-4 ASIC, 2025

NVIDIA. NVIDIA Spectrum-4 ASIC, 2025. https: //nvdam.widen.net/s/pjlcwnrdbn/ethernet- switches-spectrum-4-asic-datasheet-us

2025
[35]

NVIDIA BlueField-3 Data Processing Unit, 2023

NVIDIA Corporation. NVIDIA BlueField-3 Data Processing Unit, 2023. https://www.nvidia. com/content/dam/en-zz/Solutions/Data- Center/documents/datasheet-nvidia- bluefield-3-dpu.pdf

2023
[36]

Pre-Training GPT-4.5, 2025

OpenAI. Pre-Training GPT-4.5, 2025. https://www. youtube.com/watch?v=6nJZopACRuQ

2025
[37]

AWS Trainium3 Deep Dive - A Potential Challenger Approaching, 2025

Dylan Patel, Daniel Nishball, Wega Chu, My- ron Xie, Ivan Chiam, Clara Ee, Cheang Kang Wen, Wei Zhou, Jeremie Eliahou Ontiveros, and Tanj Bennett. AWS Trainium3 Deep Dive - A Potential Challenger Approaching, 2025. https://newsletter.semianalysis.com/p/aws- trainium3-deep-dive-a-potential

2025
[38]

Multi-Datacenter Training: Ope- nAI’s Ambitious Plan To Beat Google’s Infrastructure,

Dylan Patel, Daniel Nishball, and Jeremie Elia- hou Ontiveros. Multi-Datacenter Training: Ope- nAI’s Ambitious Plan To Beat Google’s Infrastructure,
[39]

https://newsletter.semianalysis.com/p/ multi-datacenter-training-openais
[40]

Building Prometheus: How Backend Aggrega- tion Enables Gigawatt-Scale AI Clusters, 2026

Jalpa Patel, Ankur Singh, and Hany Morsy. Building Prometheus: How Backend Aggrega- tion Enables Gigawatt-Scale AI Clusters, 2026. https://engineering.fb.com/2026/02/09/data- center-engineering/building-prometheus-how- backend-aggregation-enables-gigawatt-scale- ai-clusters/

2026
[41]

Alibaba hpn: A data center network for large language model training

Kun Qian, Yongqing Xi, Jiamin Cao, Jiaqi Gao, Yichi Xu, Yu Guan, Binzhang Fu, Xuemei Shi, Fangbo Zhu, Rui Miao, Chao Wang, Peng Wang, Pengcheng Zhang, Xianlong Zeng, Eddie Ruan, Zhiping Yao, Ennan Zhai, 14 and Dennis Cai. Alibaba hpn: A data center network for large language model training. InProceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM ’24...

2024
[42]

An- nulus: A dual congestion control loop for datacenter and wan traffic aggregates

Ahmed Saeed, Varun Gupta, Prateesh Goyal, Milad Sharif, Rong Pan, Mostafa Ammar, Ellen Zegura, Keon Jang, Mohammad Alizadeh, Abdul Kabbani, et al. An- nulus: A dual congestion control loop for datacenter and wan traffic aggregates. InProceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technolog...

2020
[43]

Ml- synth: Towards synthetic ml traces

Adel Sefiane, Alireza Farshin, and Marios Kogias. Ml- synth: Towards synthetic ml traces. InProceedings of the 2nd Workshop on Networks for AI Computing, NAIC ’25, page 98–104, New York, NY , USA, 2025. Association for Computing Machinery

2025
[44]

TACCL: Guiding collective algorithm synthesis using communi- cation sketches

Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Ja- cob Nelson, Olli Saarikivi, and Rachee Singh. TACCL: Guiding collective algorithm synthesis using communi- cation sketches. In20th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI 23), pages 593–612, Boston, MA, April 2023. USENIX As- sociation

2023
[45]

Collective communication for 100k+ gpus, 2026

Min Si, Pavan Balaji, Yongzhou Chen, Ching-Hsiang Chu, Adi Gangidi, Saif Hasan, Subodh Iyengar, Dan Johnson, Bingzhe Liu, Regina Ren, Deep Shah, Ashmitha Jeevaraj Shetty, Greg Steinbrecher, Yulun Wang, Bruce Wu, Xinfeng Xie, Jingyi Yang, Mingran Yang, Kenny Yu, Minlan Yu, Cen Zhao, Wes Bland, Denis Boyda, Suman Gumudavelli, Prashanth Kannan, Cristian Lume...

2026
[46]

Gemini: A family of highly capable multimodal models, 2025

Gemini Team et al. Gemini: A family of highly capable multimodal models, 2025

2025
[47]

Bicc: Bilateral con- gestion control in cross-datacenter rdma networks

Zirui Wan, Jiao Zhang, Mingxuan Yu, Junwei Liu, Jun Yao, Xinghua Zhao, and Tao Huang. Bicc: Bilateral con- gestion control in cross-datacenter rdma networks. In IEEE INFOCOM 2024-IEEE Conference on Computer Communications, pages 1381–1390. IEEE, 2024

2024
[48]

Switch Packet Buffers, 2019

Jim Warner. Switch Packet Buffers, 2019. https:// people.ucsc.edu/~warner/buffer.html

2019
[49]

Astra-sim2.0: Modeling hierarchical networks and dis- aggregated systems for large-model training at scale

William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna. Astra-sim2.0: Modeling hierarchical networks and dis- aggregated systems for large-model training at scale. In 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 283– 294, 2023

2023
[50]

Hashing design in modern networks: Challenges and mitigation techniques

Yunhong Xu, Keqiang He, Rui Wang, Minlan Yu, Nick Duffield, Hassan Wassel, Shidong Zhang, Leon Poutievski, Junlan Zhou, and Amin Vahdat. Hashing design in modern networks: Challenges and mitigation techniques. In2022 USENIX Annual Technical Confer- ence (USENIX ATC 22), pages 805–818, Carlsbad, CA, July 2022. USENIX Association

2022
[51]

Dibs: Just-in- time congestion mitigation for data centers

Kyriakos Zarifis, Rui Miao, Matt Calder, Ethan Katz- Bassett, Minlan Yu, and Jitendra Padhye. Dibs: Just-in- time congestion mitigation for data centers. InProceed- ings of the Ninth European Conference on Computer Systems, pages 1–14, 2014

2014
[52]

Fine- grained feedback-driven flow control in cross-datacenter rdma networks

Chiliang Zhong, Shuo Wang, Siyu Han, Zhou Dong, Peiyuan Lin, Yupeng Liang, and Tao Huang. Fine- grained feedback-driven flow control in cross-datacenter rdma networks. In2025 Thirteenth International Con- ference on Advanced Cloud and Big Data (CBD), pages 1–6. IEEE, 2025

2025
[53]

Miti- gating inter-datacenter incast with a proxy: The shortest path is not necessarily the fastest

Anchengcheng Zhou, Carter Costic, Hongyu Hè, Ahmad Ghalayini, Abdul Kabbani, and Maria Apostolaki. Miti- gating inter-datacenter incast with a proxy: The shortest path is not necessarily the fastest. InProceedings of the 24th ACM Workshop on Hot Topics in Networks, pages 344–353, 2025

2025
[54]

Congestion control for large-scale rdma deploy- ments.ACM SIGCOMM Computer Communication Review, 45(4):523–536, 2015

Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Pad- hye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. Congestion control for large-scale rdma deploy- ments.ACM SIGCOMM Computer Communication Review, 45(4):523–536, 2015. A Cross-DC Latency Mitigating Intra-DC HAR Contention Hierarchical AllReduce...

2015