Recognition: unknown
Avoiding Cross-Datacenter Collective Congestion via Disaggregated Buffering
Pith reviewed 2026-05-14 20:52 UTC · model grok-4.3
The pith
Spillway buffers dropped packets at the destination data center to prevent congestion from cross-DC collective collisions in large-scale LLM training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Spillway is a transparent in-network mechanism that buffers dropped packets in switch-disaggregated buffers in a destination data center and drains them once congestion subsides. Through large-scale end-to-end simulations and a hardware prototype, it eliminates performance degradation from collective collisions, reducing iteration time by up to 14%, without changes to end hosts or training frameworks.
What carries the argument
Switch-disaggregated buffers that temporarily store packets at the destination until local congestion clears, then forward them without requiring host intervention.
If this is right
- Iteration times in multi-DC LLM training decrease by up to 14% during collective operations.
- Severe packet loss and congestion collapse from colliding traffic are eliminated.
- No modifications to end hosts or training frameworks are needed for deployment.
- The mechanism handles the delay in congestion control responses that spans multiple milliseconds.
Where Pith is reading between the lines
- Similar buffering could help other distributed applications that span multiple data centers and rely on collective patterns.
- Hardware prototypes suggest the approach may scale to production environments with appropriate buffer sizing.
- By offloading recovery to the network, it could complement rather than replace end-to-end congestion control schemes.
Load-bearing premise
Switch-disaggregated buffers can be added transparently to existing hardware, hold enough packets for the bursts involved, and release them without introducing fresh contention points.
What would settle it
Observing no reduction in iteration time or increased packet loss in a scaled hardware test where buffer capacity is reached or draining creates downstream congestion.
Figures
read the original abstract
LLM training at the scale of tens of thousands of GPUs now spans multiple datacenters (DC), making cross-DC collectives over long-haul links unavoidable. A critical and overlooked bottleneck arises when these collectives collide with intra-DC traffic at the destination - a common pattern in real workloads. The multi-millisecond congestion control loop is too slow to react, triggering severe packet loss and congestion collapse. We present Spillway, a transparent in-network mechanism that buffers dropped packets in switch-disaggregated buffers in a destination data center and drains them once congestion subsides. Through large-scale end-to-end simulations and a hardware prototype, we show that Spillway eliminates performance degradation from collective collisions, reducing iteration time by up to 14 %, without changes to end hosts or training frameworks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Spillway, a transparent in-network mechanism that uses switch-disaggregated buffers in the destination datacenter to store packets dropped during collisions between cross-DC collectives and intra-DC traffic. It claims that this approach eliminates performance degradation in multi-DC LLM training at tens-of-thousands-of-GPUs scale, reducing iteration time by up to 14% as shown in large-scale end-to-end simulations and a hardware prototype, without requiring changes to end hosts or training frameworks.
Significance. If the empirical results hold at the target scale, Spillway would address a practical and previously overlooked congestion collapse mode in cross-datacenter collective communication, offering a deployable mitigation that preserves existing host and framework stacks. The combination of large-scale simulation and hardware prototype is a clear strength, providing direct evidence rather than purely analytical claims. The reported 14% improvement would be meaningful for production training workloads if the buffer-capacity and transparency assumptions are shown to scale.
major comments (2)
- [Abstract and prototype section] Abstract and prototype section: the hardware prototype demonstrates the buffering mechanism at small scale, but the central claim that switch-disaggregated buffers remain feasible for the multi-millisecond bursts at tens-of-thousands-of-GPUs scale is not supported by any capacity analysis or scaling argument; the manuscript provides no evidence that required buffer depth fits in existing switch hardware or that the drain path avoids new contention at the cited scale.
- [Simulation evaluation section] Simulation evaluation section: the claim of up to 14% iteration-time reduction rests on end-to-end simulations, yet no detailed baselines, traffic-pattern definitions, buffer-size parameters, or error bars are reported; without these the quantitative support for the performance claim cannot be assessed and the result remains load-bearing for the paper's contribution.
minor comments (2)
- [Introduction] Clarify in the introduction how 'switch-disaggregated buffers' are realized in commodity hardware without requiring new switch ASICs or host modifications.
- [Evaluation] Add a table or figure caption that explicitly lists the simulation parameters (link bandwidths, buffer depths, collective sizes) used to obtain the 14% figure.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of Spillway's significance and for the constructive major comments. We address each point below and will make the necessary revisions to strengthen the manuscript's claims with additional analysis and details.
read point-by-point responses
-
Referee: [Abstract and prototype section] Abstract and prototype section: the hardware prototype demonstrates the buffering mechanism at small scale, but the central claim that switch-disaggregated buffers remain feasible for the multi-millisecond bursts at tens-of-thousands-of-GPUs scale is not supported by any capacity analysis or scaling argument; the manuscript provides no evidence that required buffer depth fits in existing switch hardware or that the drain path avoids new contention at the cited scale.
Authors: We thank the referee for highlighting this gap. While our large-scale simulations implicitly validate the buffer feasibility at the target scale by achieving the reported performance without buffer overflow, we agree that an explicit analysis is necessary. In the revised manuscript, we will include a new subsection on buffer capacity requirements. This will calculate the maximum burst size based on the cross-DC collective traffic patterns at 10k+ GPUs (e.g., deriving multi-ms burst volumes from the simulation parameters) and compare it against typical disaggregated buffer sizes in modern switches (such as 256-512 MB per port in high-end hardware). Additionally, we will explain that the drain path utilizes dedicated low-priority queues to avoid introducing contention with ongoing traffic. revision: yes
-
Referee: [Simulation evaluation section] Simulation evaluation section: the claim of up to 14% iteration-time reduction rests on end-to-end simulations, yet no detailed baselines, traffic-pattern definitions, buffer-size parameters, or error bars are reported; without these the quantitative support for the performance claim cannot be assessed and the result remains load-bearing for the paper's contribution.
Authors: The referee correctly identifies that the simulation details are insufficiently documented. We will revise the evaluation section to provide: (1) precise definitions of the traffic patterns, including the specific cross-DC collective operations (e.g., all-reduce on model parameters) and their collision with intra-DC flows; (2) the baseline configurations, such as standard RDMA over TCP without Spillway; (3) the buffer sizes employed in the simulations (e.g., 100 MB per disaggregated buffer); and (4) results with error bars from at least 10 independent runs to show variability. These additions will allow readers to fully assess the 14% improvement claim. revision: yes
Circularity Check
No circularity: claims rest on direct empirical measurements from simulation and prototype
full rationale
The paper presents Spillway as an in-network buffering mechanism evaluated through large-scale end-to-end simulations and a hardware prototype. No equations, fitted parameters, derivations, or self-citation chains appear in the provided text. The central performance claim (up to 14% iteration time reduction) is reported as a measured outcome rather than a prediction derived from the mechanism's own inputs. No self-definitional loops, renamed known results, or load-bearing uniqueness theorems are present. The work is self-contained against external benchmarks via direct experimentation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Congestion-control reaction time is on the order of multiple milliseconds and therefore too slow for microsecond-scale packet bursts from collectives
invented entities (1)
-
Spillway buffering mechanism
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Burst-tolerant datacenter networks with vertigo
Sepehr Abdous, Erfan Sharafzadeh, and Soudeh Ghor- bani. Burst-tolerant datacenter networks with vertigo. InProceedings of the 17th International Conference on emerging Networking EXperiments and Technologies, pages 1–15, 2021
2021
-
[2]
Practical packet deflection in datacenters.Proceed- ings of the ACM on Networking, 1(CoNEXT3):1–25, 2023
Sepehr Abdous, Erfan Sharafzadeh, and Soudeh Ghor- bani. Practical packet deflection in datacenters.Proceed- ings of the ACM on Networking, 1(CoNEXT3):1–25, 2023
2023
-
[3]
Claude Mythos 5: What the First 10-Trillion-Parameter Model Actually Means for Developers, 2026
AI Magicx Team. Claude Mythos 5: What the First 10-Trillion-Parameter Model Actually Means for Developers, 2026. https://www.aimagicx.com/ blog/claude-mythos-5-trillion-parameter- model-developer-guide-2026
2026
-
[4]
Microsoft azure delivers the first large scale cluster with NVIDIA GB300 NVL72 for OpenAI workloads,
Rani Borkar and Nidhi Chappell. Microsoft azure delivers the first large scale cluster with NVIDIA GB300 NVL72 for OpenAI workloads,
-
[5]
https://azure.microsoft.com/en- us/blog/microsoft-azure-delivers-the-first- large-scale-cluster-with-nvidia-gb300- nvl72-for-openai-workloads/
-
[6]
Syccl: Exploiting symmetry for efficient collective com- munication scheduling
Jiamin Cao, Shangfeng Shi, Jiaqi Gao, Weisen Liu, Yi- fan Yang, Yichi Xu, Zhilong Zheng, Yu Guan, Kun Qian, Ying Liu, Mingwei Xu, Tianshu Wang, Ning Wang, Jianbo Dong, Binzhang Fu, Dennis Cai, and Ennan Zhai. Syccl: Exploiting symmetry for efficient collective com- munication scheduling. SIGCOMM ’25, page 645–662, New York, NY , USA, 2025. Association for...
2025
-
[7]
Priority Flow Control: Build Reliable Layer 2 Infrastructure , 2015
Cisco. Priority Flow Control: Build Reliable Layer 2 Infrastructure , 2015. https://e2e.ti.com/cfs- file/__key/communityserver-discussions- components-files/908/802.1q-Flow-Control- white_5F00_paper_5F00_c11_2D00_542809.pdf
2015
-
[8]
xAI’s Colos- sus 2 - First Gigawatt Datacenter In The World, Unique RL Methodology, Capital Raise, 2025
Jeremie Eliahou Ontiveros, Dylan Patel, Wei Zhou, AJ Kourabi, and Maya Barkin. xAI’s Colos- sus 2 - First Gigawatt Datacenter In The World, Unique RL Methodology, Capital Raise, 2025. http://semianalysis.com/p/xais-colossus-2- first-gigawatt-datacenter
2025
-
[9]
Rdma over ethernet for distributed training at meta scale
Adithya Gangidi, Rui Miao, Shengbao Zheng, Sai Jayesh Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Mohammad Riftadi, Ashmitha Jeevaraj Shetty, Jingyi Yang, Shuqiang Zhang, Mikel Jimenez Fernandez, Shashidhar Gandham, and Hongyi Zeng. Rdma over ethernet for distributed training at meta scale. ACM SIGCOMM ’24, page 57–70, New York, NY , USA, 2024. Associa...
2024
-
[10]
Flowmoe: A scalable pipeline scheduling framework for distributed mixture-of-experts training, 2025
Yunqi Gao, Bing Hu, Mahdi Boloursaz Mashhadi, A- Long Jin, Yanfeng Zhang, Pei Xiao, Rahim Tafazolli, and Merouane Debbah. Flowmoe: A scalable pipeline scheduling framework for distributed mixture-of-experts training, 2025
2025
-
[11]
I’ve got 99 problems but flops ain’t one
Alexandru M Gherghescu, Vlad-Andrei B˘adoiu, Alexan- dru Agache, Mihai-Valentin Dumitru, Iuliu Vasilescu, Radu Mantu, and Costin Raiciu. I’ve got 99 problems but flops ain’t one. InProceedings of the 23rd ACM Workshop on Hot Topics in Networks, pages 195–204, 2024
2024
-
[12]
Infinite scale: The architec- ture behind the Azure AI superfactory, 2025
Scott Guthrie. Infinite scale: The architec- ture behind the Azure AI superfactory, 2025. https://blogs.microsoft.com/blog/2025/11/ 12/infinite-scale-the-architecture-behind- the-azure-ai-superfactory/
2025
-
[13]
Inter-data center rdma: Challenges, status, and future directions.Future Internet, 17(6), 2025
Xiaoying Huang and Jingwei Wang. Inter-data center rdma: Challenges, status, and future directions.Future Internet, 17(6), 2025
2025
-
[14]
Energy consumption in parallel neural network training, 2025
Philipp Huber, David Li, Juan Pedro Gutiérrez Her- mosillo Muriedas, Deifilia Kieckhefen, Markus Götz, Achim Streit, and Charlotte Debus. Energy consumption in parallel neural network training, 2025
2025
-
[15]
InfiniBand Trade Association.InfiniBand Architecture Specification, Volume 1, Release 1.5, 2021
2021
-
[16]
Accelerating distributed MoE training and inference with lina
Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. Accelerating distributed MoE training and inference with lina. In2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 945–959, Boston, MA, July 2023. USENIX Association
2023
-
[17]
Hanks, David Meyer, and Paul S
Tony Li, Dino Farinacci, Stanley P. Hanks, David Meyer, and Paul S. Traina. Generic Routing Encapsulation (GRE). RFC 2784, March 2000
2000
-
[18]
Revisiting rdma reliability for lossy fabrics
Wenxue Li, Xiangzhou Liu, Yunxuan Zhang, Zihao Wang, Wei Gu, Tao Qian, Gaoxiong Zeng, Shoushou Ren, Xinyang Huang, Zhenghang Ren, et al. Revisiting rdma reliability for lossy fabrics. InProceedings of the ACM SIGCOMM 2025 Conference, pages 85–98, 2025
2025
-
[19]
Hpcc: high precision congestion control
Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Alizadeh, and Minlan Yu. Hpcc: high precision congestion control. InProceed- ings of the ACM Special Interest Group on Data Com- munication, SIGCOMM ’19, page 44–58, New York, NY , USA, 2019. Association for Computing Machinery
2019
-
[20]
Handling future congestion in cross-datacenter rdma networks
Peiyuan Lin, Shuo Wang, Dong Zhou, Siyu Han, Chil- iang Zhong, Yupeng Liang, and Tao Huang. Handling future congestion in cross-datacenter rdma networks. In 13 2025 Thirteenth International Conference on Advanced Cloud and Big Data (CBD), pages 7–12, 2025
2025
-
[21]
Rethinking machine learn- ing collective communication as a multi-commodity flow problem
Xuting Liu, Behnaz Arzani, Siva Kesava Reddy Kakarla, Liangyu Zhao, Vincent Liu, Miguel Castro, Srikanth Kandula, and Luke Marshall. Rethinking machine learn- ing collective communication as a multi-commodity flow problem. InProceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM ’24, page 16–37, New York, NY , USA, 2024. Association for Computing Machinery
2024
-
[22]
Multi-Path transport for RDMA in datacenters
Yuanwei Lu, Guo Chen, Bojie Li, Kun Tan, Yongqiang Xiong, Peng Cheng, Jiansong Zhang, Enhong Chen, and Thomas Moscibroda. Multi-Path transport for RDMA in datacenters. In15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pages 357–371, Renton, WA, April 2018. USENIX Associa- tion
2018
-
[23]
Load balancing for ai training workloads, 2026
Sarah McClure, Evyatar Cohen, Alex Shpiner, Mark Sil- berstein, Sylvia Ratnasamy, Scott Shenker, and Isaac Keslassy. Load balancing for ai training workloads, 2026
2026
-
[24]
Revisiting network support for rdma
Radhika Mittal, Alexander Shpiner, Aurojit Panda, Eitan Zahavi, Arvind Krishnamurthy, Sylvia Ratnasamy, and Scott Shenker. Revisiting network support for rdma. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pages 313–326, 2018
2018
-
[25]
Efficient large-scale language model training on gpu clusters using megatron-lm
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. InProceedings of the Inter- national Conference for High Perfo...
2021
-
[26]
Themis: Addressing congestion-induced unfairness in long-haul rdma net- works
Zihan Niu, Menghao Zhang, Jue Zhang, Renjie Xie, Yuan Yang, and Xiaohe Hu. Themis: Addressing congestion-induced unfairness in long-haul rdma net- works. In2025 IEEE 33rd International Conference on Network Protocols (ICNP), pages 1–13. IEEE, 2025
2025
-
[27]
NVIDIA InfiniBand Adaptive Rout- ing Technology, 2023
NVIDIA. NVIDIA InfiniBand Adaptive Rout- ing Technology, 2023. https://storage. ghost.io/c/35/17/35170502-dfe4-4f36-9612- bdc657f28241/content/files/2023/12/NVIDIA_ InfiniBand_Adaptive_Routing_Technology_ Insights_Whitepaper.pdf
2023
-
[28]
NVIDIA Spectrum-X White Paper,
NVIDIA. NVIDIA Spectrum-X White Paper,
-
[29]
https://resources.nvidia.com/en- us-accelerated-networking-resource- library/nvidia-spectrum-x
-
[30]
Turbocharge LLM Training across Long-Haul Data Center Networks with NVIDIA NeMo Framework,
NVIDIA. Turbocharge LLM Training across Long-Haul Data Center Networks with NVIDIA NeMo Framework,
-
[31]
https://developer.nvidia.com/blog/ turbocharge-llm-training-across-long-haul- data-center-networks-with-nvidia-nemo- framework/
-
[32]
NVIDIA DGX B200, 2025
NVIDIA. NVIDIA DGX B200, 2025. https://www. nvidia.com/en-us/data-center/dgx-b200/
2025
-
[33]
NVIDIA DGX SuperPOD: Reference architecture – network fabrics, 2025
NVIDIA. NVIDIA DGX SuperPOD: Reference architecture – network fabrics, 2025. https: //docs.nvidia.com/dgx-superpod/reference- architecture-scalable-infrastructure- b200/latest/network-fabrics.html
2025
-
[34]
NVIDIA Spectrum-4 ASIC, 2025
NVIDIA. NVIDIA Spectrum-4 ASIC, 2025. https: //nvdam.widen.net/s/pjlcwnrdbn/ethernet- switches-spectrum-4-asic-datasheet-us
2025
-
[35]
NVIDIA BlueField-3 Data Processing Unit, 2023
NVIDIA Corporation. NVIDIA BlueField-3 Data Processing Unit, 2023. https://www.nvidia. com/content/dam/en-zz/Solutions/Data- Center/documents/datasheet-nvidia- bluefield-3-dpu.pdf
2023
-
[36]
Pre-Training GPT-4.5, 2025
OpenAI. Pre-Training GPT-4.5, 2025. https://www. youtube.com/watch?v=6nJZopACRuQ
2025
-
[37]
AWS Trainium3 Deep Dive - A Potential Challenger Approaching, 2025
Dylan Patel, Daniel Nishball, Wega Chu, My- ron Xie, Ivan Chiam, Clara Ee, Cheang Kang Wen, Wei Zhou, Jeremie Eliahou Ontiveros, and Tanj Bennett. AWS Trainium3 Deep Dive - A Potential Challenger Approaching, 2025. https://newsletter.semianalysis.com/p/aws- trainium3-deep-dive-a-potential
2025
-
[38]
Multi-Datacenter Training: Ope- nAI’s Ambitious Plan To Beat Google’s Infrastructure,
Dylan Patel, Daniel Nishball, and Jeremie Elia- hou Ontiveros. Multi-Datacenter Training: Ope- nAI’s Ambitious Plan To Beat Google’s Infrastructure,
-
[39]
https://newsletter.semianalysis.com/p/ multi-datacenter-training-openais
-
[40]
Building Prometheus: How Backend Aggrega- tion Enables Gigawatt-Scale AI Clusters, 2026
Jalpa Patel, Ankur Singh, and Hany Morsy. Building Prometheus: How Backend Aggrega- tion Enables Gigawatt-Scale AI Clusters, 2026. https://engineering.fb.com/2026/02/09/data- center-engineering/building-prometheus-how- backend-aggregation-enables-gigawatt-scale- ai-clusters/
2026
-
[41]
Alibaba hpn: A data center network for large language model training
Kun Qian, Yongqing Xi, Jiamin Cao, Jiaqi Gao, Yichi Xu, Yu Guan, Binzhang Fu, Xuemei Shi, Fangbo Zhu, Rui Miao, Chao Wang, Peng Wang, Pengcheng Zhang, Xianlong Zeng, Eddie Ruan, Zhiping Yao, Ennan Zhai, 14 and Dennis Cai. Alibaba hpn: A data center network for large language model training. InProceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM ’24...
2024
-
[42]
An- nulus: A dual congestion control loop for datacenter and wan traffic aggregates
Ahmed Saeed, Varun Gupta, Prateesh Goyal, Milad Sharif, Rong Pan, Mostafa Ammar, Ellen Zegura, Keon Jang, Mohammad Alizadeh, Abdul Kabbani, et al. An- nulus: A dual congestion control loop for datacenter and wan traffic aggregates. InProceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technolog...
2020
-
[43]
Ml- synth: Towards synthetic ml traces
Adel Sefiane, Alireza Farshin, and Marios Kogias. Ml- synth: Towards synthetic ml traces. InProceedings of the 2nd Workshop on Networks for AI Computing, NAIC ’25, page 98–104, New York, NY , USA, 2025. Association for Computing Machinery
2025
-
[44]
TACCL: Guiding collective algorithm synthesis using communi- cation sketches
Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Ja- cob Nelson, Olli Saarikivi, and Rachee Singh. TACCL: Guiding collective algorithm synthesis using communi- cation sketches. In20th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI 23), pages 593–612, Boston, MA, April 2023. USENIX As- sociation
2023
-
[45]
Collective communication for 100k+ gpus, 2026
Min Si, Pavan Balaji, Yongzhou Chen, Ching-Hsiang Chu, Adi Gangidi, Saif Hasan, Subodh Iyengar, Dan Johnson, Bingzhe Liu, Regina Ren, Deep Shah, Ashmitha Jeevaraj Shetty, Greg Steinbrecher, Yulun Wang, Bruce Wu, Xinfeng Xie, Jingyi Yang, Mingran Yang, Kenny Yu, Minlan Yu, Cen Zhao, Wes Bland, Denis Boyda, Suman Gumudavelli, Prashanth Kannan, Cristian Lume...
2026
-
[46]
Gemini: A family of highly capable multimodal models, 2025
Gemini Team et al. Gemini: A family of highly capable multimodal models, 2025
2025
-
[47]
Bicc: Bilateral con- gestion control in cross-datacenter rdma networks
Zirui Wan, Jiao Zhang, Mingxuan Yu, Junwei Liu, Jun Yao, Xinghua Zhao, and Tao Huang. Bicc: Bilateral con- gestion control in cross-datacenter rdma networks. In IEEE INFOCOM 2024-IEEE Conference on Computer Communications, pages 1381–1390. IEEE, 2024
2024
-
[48]
Switch Packet Buffers, 2019
Jim Warner. Switch Packet Buffers, 2019. https:// people.ucsc.edu/~warner/buffer.html
2019
-
[49]
Astra-sim2.0: Modeling hierarchical networks and dis- aggregated systems for large-model training at scale
William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna. Astra-sim2.0: Modeling hierarchical networks and dis- aggregated systems for large-model training at scale. In 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 283– 294, 2023
2023
-
[50]
Hashing design in modern networks: Challenges and mitigation techniques
Yunhong Xu, Keqiang He, Rui Wang, Minlan Yu, Nick Duffield, Hassan Wassel, Shidong Zhang, Leon Poutievski, Junlan Zhou, and Amin Vahdat. Hashing design in modern networks: Challenges and mitigation techniques. In2022 USENIX Annual Technical Confer- ence (USENIX ATC 22), pages 805–818, Carlsbad, CA, July 2022. USENIX Association
2022
-
[51]
Dibs: Just-in- time congestion mitigation for data centers
Kyriakos Zarifis, Rui Miao, Matt Calder, Ethan Katz- Bassett, Minlan Yu, and Jitendra Padhye. Dibs: Just-in- time congestion mitigation for data centers. InProceed- ings of the Ninth European Conference on Computer Systems, pages 1–14, 2014
2014
-
[52]
Fine- grained feedback-driven flow control in cross-datacenter rdma networks
Chiliang Zhong, Shuo Wang, Siyu Han, Zhou Dong, Peiyuan Lin, Yupeng Liang, and Tao Huang. Fine- grained feedback-driven flow control in cross-datacenter rdma networks. In2025 Thirteenth International Con- ference on Advanced Cloud and Big Data (CBD), pages 1–6. IEEE, 2025
2025
-
[53]
Miti- gating inter-datacenter incast with a proxy: The shortest path is not necessarily the fastest
Anchengcheng Zhou, Carter Costic, Hongyu Hè, Ahmad Ghalayini, Abdul Kabbani, and Maria Apostolaki. Miti- gating inter-datacenter incast with a proxy: The shortest path is not necessarily the fastest. InProceedings of the 24th ACM Workshop on Hot Topics in Networks, pages 344–353, 2025
2025
-
[54]
Congestion control for large-scale rdma deploy- ments.ACM SIGCOMM Computer Communication Review, 45(4):523–536, 2015
Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Pad- hye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. Congestion control for large-scale rdma deploy- ments.ACM SIGCOMM Computer Communication Review, 45(4):523–536, 2015. A Cross-DC Latency Mitigating Intra-DC HAR Contention Hierarchical AllReduce...
2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.