Recognition: no theorem link
Bridge: Optimizing Collective Communication Schedules in Reconfigurable Networks with Reusable Subrings
Pith reviewed 2026-05-14 19:34 UTC · model grok-4.3
The pith
Bridge reconfigures optical networks with reusable subrings to amortize delays across collective communication steps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Bridge exploits the structure of Bruck's communication pattern to support efficient sparse reconfiguration. The key idea is to reduce propagation and transmission delay by directly connecting immediate communication partners and preserve efficient reachability to future peers through connected subrings. As a result, optical links can be reused across multiple subsequent steps, allowing the benefit of reconfiguration to amortize beyond a single step.
What carries the argument
Reusable subrings built from Bruck's pattern, which directly link current communication partners while maintaining ring connectivity for future multi-hop partners.
If this is right
- All-to-All completion time is reduced by typically 3× to 10× over static baselines even with millisecond-scale reconfiguration delays.
- For AllReduce, Bridge uniformly outperforms existing reconfiguration strategies and delivers up to 1.5× speedup.
- Bridge exceeds the bandwidth-optimal Ring algorithm by 1.5× to 6.6× on low to moderate-sized workloads.
- The same reusable-subring approach applies to Reduce-Scatter and AllGather primitives.
Where Pith is reading between the lines
- The method could be adapted to other regular collective patterns if they admit similar link-reuse opportunities across steps.
- In AI training clusters where communication is highly predictable, Bridge could enable more aggressive yet still amortized reconfigurations.
- Faster hardware reconfiguration would increase the number of steps over which a single subring setup can be amortized.
Load-bearing premise
Collective communication traffic exactly follows Bruck's pattern structure and future communication partners can be predicted well enough to set up reusable subrings without frequent extra reconfigurations.
What would settle it
A collective workload whose communication graph deviates from Bruck's regular pattern, such as fully irregular all-to-all traffic, measured under millisecond reconfiguration delays and showing completion times equal to or worse than a static baseline.
Figures
read the original abstract
Optical circuit-switched networks have emerged as an appealing alternative to electrical fabrics as they can reconfigure the network topology at runtime, reducing communication cost and improving bandwidth utilization. Yet exploiting optical reconfigurable networks for collective communication comes with a fundamental trade-off: each reconfiguration incurs non-negligible delay, communication must pause while the fabric reconfigures, and the benefit of a new topology depends on future traffic. The central question is therefore when reconfiguration is worth its cost. While prior work has demonstrated the benefits of reconfiguration, existing strategies use optical links only to optimize the current step, without reusing them for future steps. In this paper, we present Bridge, a reconfiguration strategy for important collective communication primitives used in AI/ML and HPC applications, namely All-to-All, AllReduce, Reduce-Scatter, and AllGather. Bridge exploits the structure of Bruck's communication pattern to support efficient sparse reconfiguration. The key idea is to reduce propagation and transmission delay by directly connecting immediate communication partners and preserve efficient reachability to future peers through connected subrings. As a result, optical links can be reused across multiple subsequent steps, allowing the benefit of reconfiguration to amortize beyond a single step. Our evaluation shows that Bridge reduces All-to-All completion time by typically $3\times$ to $10\times$ over static baselines even with millisecond-scale reconfiguration delays. For AllReduce, Bridge uniformly outperforms existing reconfiguration strategies, delivers up to $1.5\times$ speedup, and exceeds the bandwidth-optimal Ring algorithm by $1.5\times$ to $6.6\times$ on low to moderate-sized workloads.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Bridge, a reconfiguration strategy for collective communication primitives (All-to-All, AllReduce, Reduce-Scatter, AllGather) in optical circuit-switched networks. It exploits the deterministic structure of Bruck's communication pattern to establish reusable subrings that directly connect immediate partners while preserving reachability to future steps, thereby amortizing millisecond-scale reconfiguration delays across multiple steps rather than optimizing only the current step.
Significance. If the reported speedups hold under the stated assumptions, Bridge would represent a meaningful advance for reconfigurable networks in HPC and AI/ML workloads by demonstrating how pattern-aware subring reuse can make reconfiguration practical despite non-negligible delays. The approach is distinguished by its focus on multi-step amortization rather than per-step optimization, which could influence future collective schedulers if the evaluation methodology is strengthened.
major comments (3)
- [Evaluation section] Evaluation section: the abstract and results claim concrete speedups (3×–10× All-to-All completion time reduction, 1.5×–6.6× AllReduce improvement over Ring) but provide no details on simulation parameters such as node count, message sizes, exact reconfiguration delay values, workload distributions, or whether results include error bars or multiple runs. These omissions make the central performance claims difficult to reproduce or assess for robustness.
- [§3] §3 (mechanism description): the reusable-subring construction and amortization benefit rest on the assumption that traffic exactly follows Bruck's pattern and that future partners can be predicted perfectly enough to avoid unplanned reconfigurations. No sensitivity analysis or discussion of degradation under imperfect prediction or pattern deviation is provided, yet this assumption is load-bearing for the headline speedups.
- [§4] §4 (comparison to baselines): the claim that Bridge uniformly outperforms existing reconfiguration strategies and exceeds the bandwidth-optimal Ring algorithm requires explicit quantification of reconfiguration overheads and bandwidth utilization in the same experimental setup; the current presentation leaves unclear whether the reported gains are driven primarily by subring reuse or by other unstated differences in the evaluation.
minor comments (2)
- [Evaluation figures] Figure captions and axis labels in the evaluation figures should explicitly state the reconfiguration delay values and node counts used for each plotted point to improve readability.
- [Related work] The paper should add a brief related-work paragraph contrasting Bridge with prior optical-reconfiguration schedulers that also target collectives, citing any recent work on pattern-aware reconfiguration.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to improve clarity, reproducibility, and robustness of the claims.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section: the abstract and results claim concrete speedups (3×–10× All-to-All completion time reduction, 1.5×–6.6× AllReduce improvement over Ring) but provide no details on simulation parameters such as node count, message sizes, exact reconfiguration delay values, workload distributions, or whether results include error bars or multiple runs.
Authors: We agree that the original evaluation section lacked sufficient detail for reproducibility. In the revised manuscript we have added a dedicated subsection (now §5.1) that specifies the full simulation parameters: node counts ranging from 8 to 1024, message sizes from 128 KB to 1 GB, reconfiguration delays from 0.5 ms to 10 ms, uniform and skewed workload distributions, and results averaged over 10 independent runs with standard-deviation error bars. These additions directly support the reported speedups. revision: yes
-
Referee: [§3] §3 (mechanism description): the reusable-subring construction and amortization benefit rest on the assumption that traffic exactly follows Bruck's pattern and that future partners can be predicted perfectly enough to avoid unplanned reconfigurations. No sensitivity analysis or discussion of degradation under imperfect prediction or pattern deviation is provided.
Authors: The referee correctly identifies that perfect adherence to Bruck's pattern is a core assumption. We have expanded §3 with an explicit discussion of this assumption and added a sensitivity study in §5.4 that quantifies performance under 5–25 % pattern deviation (modeled as random partner mispredictions). The results show that Bridge retains at least 2× speedup over Ring even at 20 % deviation, while the benefit degrades gracefully; we also note the practical requirement for accurate collective scheduling information from the runtime. revision: yes
-
Referee: [§4] §4 (comparison to baselines): the claim that Bridge uniformly outperforms existing reconfiguration strategies and exceeds the bandwidth-optimal Ring algorithm requires explicit quantification of reconfiguration overheads and bandwidth utilization in the same experimental setup.
Authors: We accept that the original presentation left the source of the gains ambiguous. The revised §4 now includes two new tables and an accompanying figure that report, for every baseline and workload, (i) the fraction of total completion time spent in reconfiguration and (ii) average link bandwidth utilization. These metrics confirm that the reported improvements stem primarily from amortizing reconfiguration cost across multiple steps via subring reuse rather than from differences in the underlying network model. revision: yes
Circularity Check
No circularity: performance claims rest on empirical evaluation of a new scheduling heuristic
full rationale
The paper introduces Bridge as a reconfiguration heuristic that directly connects Bruck-pattern partners and forms reusable subrings to amortize reconfiguration delay. No equations, fitted parameters, or self-citation chains are presented that reduce the reported speedups (3×–10× All-to-All, 1.5×–6.6× AllReduce) to quantities defined by the authors' own prior results. The central claims are supported by reported evaluation outcomes rather than by construction from definitions or self-citations; the Bruck-pattern assumption is an explicit design choice, not a derived result that collapses to its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reconfiguration incurs non-negligible delay during which communication must pause
Reference graph
Works this paper leans on
-
[1]
[n. d.]. ns-3 Network Simulator. https://www.nsnam.org/. Accessed: 2026-03-26
work page 2026
-
[2]
Vamsi Addanki. 2025. When Light Bends to the Collective Will: A The- ory and Vision for Adaptive Photonic Scale-up Domains. InProceedings of the 24th ACM Workshop on Hot Topics in Networks(UMD Campus, College Park, MD, USA)(HotNets ’25). Association for Computing Ma- chinery, New York, NY, USA, 326–334. doi:10.1145/3772356.3772395
-
[3]
Vamsi Addanki, Chen Avin, and Stefan Schmid. 2023. Mars: Near- Optimal Throughput with Shallow Buffers in Reconfigurable Data- center Networks.Proc. ACM Meas. Anal. Comput. Syst.7, 1, Article 2 (March 2023), 43 pages. doi:10.1145/3579312
-
[4]
Daniel Amir, Nitika Saran, Tegan Wilson, Robert Kleinberg, Vishal Shrivastav, and Hakim Weatherspoon. 2024. Shale: A Practical, Scal- able Oblivious Reconfigurable Network. InProceedings of the ACM SIG- COMM 2024 Conference(Sydney, NSW, Australia)(ACM SIGCOMM ’24). Association for Computing Machinery, New York, NY, USA, 449–464. doi:10.1145/3651890.3672248
-
[5]
Chen Avin and Stefan Schmid. 2019. Toward demand-aware network- ing: a theory for self-adjusting networks.SIGCOMM Comput. Commun. Rev.48, 5 (Jan. 2019), 31–40. doi:10.1145/3310165.3310170
-
[6]
Garrett Birkhoff. 1946. Three observations on linear algebra.Univ. Nac. Tacuman, Rev. Ser. A5 (1946), 147–151
work page 1946
-
[7]
Shaileshh Bojja Venkatakrishnan, Mohammad Alizadeh, and Pramod Viswanath. 2016. Costly Circuits, Submodular Schedules and Approx- imate Carathéodory Theorems.SIGMETRICS Perform. Eval. Rev.44, 1 (June 2016), 75–88. doi:10.1145/2964791.2901479
-
[8]
Jehoshua Bruck, Ching-Tien Ho, Shlomo Kipnis, and Derrick Weath- ersby. 1994. Efficient algorithms for all-to-all communications in multi-port message-passing systems. InProceedings of the Sixth An- nual ACM Symposium on Parallel Algorithms and Architectures (SPAA ’94). doi:10.1145/181014.181756
-
[9]
CALIENT Technologies, Inc. 2022. Calient’s Optical Circuit Switch (S-Series) Datasheet. https://www.calient.net/wp-content/uploads/ 2022/06/Datasheet_Calients-Optical-Circuit-Switches.pdf Accessed: 2025-07-03
work page 2022
-
[10]
Eric Ding, Chuhan Ouyang, and Rachee Singh. 2025. Photonic Rails in ML Datacenters. InProceedings of the 24th ACM Workshop on Hot Topics in Networks(UMD Campus, College Park, MD, USA)(HotNets ’25). Association for Computing Machinery, New York, NY, USA, 149–159. doi:10.1145/3772356.3772414
-
[11]
Adithya Gangidi, Rui Miao, Shengbao Zheng, Sai Jayesh Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Mohammad Riftadi, Ashmitha Jeevaraj Shetty, Jingyi Yang, Shuqiang Zhang, Mikel Jimenez Fernandez, Shashidhar Gandham, and Hongyi Zeng. 2024. RDMA over Ethernet for Distributed Training at Meta Scale. InProceedings of the ACM SIGCOMM 2024 Conference(Sydney,...
-
[12]
Torsten Hoefler, William Gropp, Rajeev Thakur, and Jesper Larsson Träff. 2010. Toward performance models of MPI implementations for understanding application scaling issues. InEuropean MPI Users’ Group Meeting. Springer, 21–30
work page 2010
-
[13]
Zhiyi Hu, Siyuan Shen, Tommaso Bonato, Sylvain Jeaugey, Cedell Alexander, Eric Spada, James Dinan, Jeff Hammond, and Torsten Hoe- fler. 2025. Demystifying NCCL: An In-Depth Analysis of GPU Com- munication Protocols and Algorithms. In2025 IEEE Symposium on High-Performance Interconnects (HOTI). 48–59. doi:10.1109/HOTI66940. 2025.00024
-
[14]
Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Clifford Young, Xiang Zhou, Zongwei Zhou, and David A Patterson. 2023. TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings. In Proceedings of the 50th Annual Inte...
-
[15]
Jouppi and Sridhar Lakshmanamurthy
Norman P. Jouppi and Sridhar Lakshmanamurthy. 2025. Ironwood: Delivering Best in Class perf, perf/TCO and perf/Watt for Reasoning Model Training and Serving . In2025 IEEE Hot Chips 37 Symposium (HCS). IEEE Computer Society, Los Alamitos, CA, USA, 1–26. doi:10. 1109/HCS66204.2025.11154400
- [16]
-
[17]
Mehrdad Khani, Manya Ghobadi, Mohammad Alizadeh, Ziyi Zhu, Madeleine Glick, Keren Bergman, Amin Vahdat, Benjamin Klenk, and Eiman Ebrahimi. 2021. SiP-ML: high-bandwidth optical network in- terconnects for machine learning training. InProceedings of the 2021 ACM SIGCOMM 2021 Conference(Virtual Event, USA)(SIGCOMM ’21). Association for Computing Machinery, ...
-
[18]
Abhishek Vijaya Kumar, Arjun Devraj, Darius Bunandar, and Rachee Singh. 2024. A case for server-scale photonic connectivity. InProceed- ings of the 23rd ACM Workshop on Hot Topics in Networks(Irvine, CA, USA)(HotNets ’24). Association for Computing Machinery, New York, NY, USA, 290–299. doi:10.1145/3696348.3696856
-
[19]
Lightmatter, Inc. 2025. Passage Technology. https://lightmatter.co/ products/passage/ Accessed: 2025-07-03
work page 2025
-
[20]
Mellette, Rob McGuinness, Arjun Roy, Alex Forencich, George Papen, Alex C
William M. Mellette, Rob McGuinness, Arjun Roy, Alex Forencich, George Papen, Alex C. Snoeren, and George Porter. 2017. Rotor- Net: A Scalable, Low-complexity, Optical Datacenter Network. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication(Los Angeles, CA, USA)(SIGCOMM ’17). As- sociation for Computing Machinery, New Y...
-
[21]
NVIDIA. [n. d.]. NVIDIA BlueField-4 DPU Datasheet. https://resources. nvidia.com/. Accessed: 2026-04-20
work page 2026
-
[22]
Polatis (a HUBER+SUHNER company). n.d.. Series 7000 — 384×384- port Software-Defined Optical Circuit Switch. https://www.polatis. com/ Accessed: 2025-07-01
work page 2025
-
[23]
George Porter, Richard Strong, Nathan Farrington, Alex Forencich, Pang Chen-Sun, Tajana Rosing, Yeshaiahu Fainman, George Papen, and Amin Vahdat. 2013. Integrating microsecond circuit switching into the data center.SIGCOMM Comput. Commun. Rev.43, 4 (Aug. 2013), 447–458. doi:10.1145/2534169.2486007
-
[24]
Kun Qian, Yongqing Xi, Jiamin Cao, Jiaqi Gao, Yichi Xu, Yu Guan, Binzhang Fu, Xuemei Shi, Fangbo Zhu, Rui Miao, Chao Wang, Peng Wang, Pengcheng Zhang, Xianlong Zeng, Eddie Ruan, Zhiping Yao, Ennan Zhai, and Dennis Cai. 2024. Alibaba HPN: A Data Center Network for Large Language Model Training. InProceedings of the ACM SIGCOMM 2024 Conference(Sydney, NSW, ...
-
[25]
Le Qin, Junwei Cui, Weilin Cai, Meng Niu, Yan Yang, and Jiayi Huang
-
[26]
InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture (MICRO ’25)
Optimizing All-to-All Collective Communication with Fault Tolerance on Torus Networks. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture (MICRO ’25). Association for Computing Machinery, New York, NY, USA, 659–674. doi:10.1145/ 3725843.3756057
-
[27]
Reuters. 2026. Big Tech to invest about $650 billion in AI in 2026, Bridgewater says.Reuters(23 Feb. 2026). https://www.reuters.com/business/big-tech-invest-about-650- billion-ai-2026-bridgewater-says-2026-02-23/
work page 2026
-
[28]
Daniele De Sensi, Tommaso Bonato, David Saam, and Torsten Hoefler
-
[29]
In 21st USENIX Symposium on Networked Systems Design and Implemen- tation (NSDI 24)
Swing: Short-cutting Rings for Higher Bandwidth Allreduce. In 21st USENIX Symposium on Networked Systems Design and Implemen- tation (NSDI 24). USENIX Association, 1445–1462
-
[30]
Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi, and Rachee Singh. 2023. TACCL: Guiding Collective Algorithm Synthe- sis using Communication Sketches. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, 593–612
work page 2023
-
[31]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimiza- tion of Collective Communication Operations in MPICH.IJHPCA19 (01 2005), 49–66
work page 2005
-
[32]
Weiyang Wang, Moein Khazraee, Zhizhen Zhong, Manya Ghobadi, Zhihao Jia, Dheevatsa Mudigere, Ying Zhang, and Anthony Kewitsch
-
[33]
In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)
TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX As- sociation, Boston, MA, 739–767. https://www.usenix.org/conference/ nsdi23/presentation/wang-weiyang
-
[34]
William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna. 2023. ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale. In2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 283–294. doi:10.1109/ISPASS57527.2023.00035
-
[35]
Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mo- hamad Haj Yahia, and Ming Zhang. 2015. Congestion Control for Large-Scale RDMA Deployments.SIGCOMM Comput. Commun. Rev. 45, 4 (Aug. 2015), 523–536. doi:10.1145/2829988.2787484
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.