Recognition: unknown
Multi-Plane HyperX: A Low-Latency and Cost-Effective Network for Large-Scale AI and HPC Systems
Pith reviewed 2026-05-08 05:17 UTC · model grok-4.3
The pith
Multi-plane HyperX provides smaller diameter and higher cost-effectiveness than multi-plane Fat-Tree or Dragonfly for AI and HPC systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
This paper investigates the multi-plane HyperX network and demonstrates that, compared to state-of-the-art network topologies like multi-plane Fat-Tree, Dragonfly, and Dragonfly+, the multi-plane HyperX architecture achieves a significantly smaller network diameter and superior cost-effectiveness.
What carries the argument
Multi-plane HyperX architecture: allocating multiple NIC ports or NICs to independent planes in a HyperX topology to enable shorter paths and efficient scaling.
If this is right
- Large-scale AI training jobs experience reduced communication latency from the smaller diameter.
- Total ownership costs decrease for clusters of equivalent size and performance.
- Network designers gain a new option for balancing latency and expense in direct networks.
- Scalability improves as more nodes can be added without diameter growing as fast.
Where Pith is reading between the lines
- Hardware vendors could prioritize NICs with more ports to support such multi-plane designs.
- Routing protocols might need adaptation to exploit the plane structure for even better performance.
- Similar multi-planing could be applied to other direct networks to check for comparable benefits.
Load-bearing premise
That the multi-plane approach transfers to HyperX without adding unforeseen routing overheads or requiring non-comparable cost models.
What would settle it
A side-by-side benchmark of packet latencies and total hardware costs in simulated or built 10,000-node instances of each topology under AI-like traffic patterns.
Figures
read the original abstract
Multi-plane architectures have become increasingly prevalent in the Fat-Tree networks of AI data centers. By leveraging multiple ports on a single network interface card (NIC) or multiple NICs within a scale-up domain, each port or NIC is allocated to an independent network plane, thereby provisioning the overall system with multiple network planes. However, no prior literature has explored the application of multi-plane technologies to direct networks such as HyperX. This paper investigates the multi-plane HyperX network and demonstrates that, compared to state-of-the-art network topologies like multi-plane Fat-Tree, Dragonfly, and Dragonfly+, the multi-plane HyperX architecture achieves a significantly smaller network diameter and superior cost-effectiveness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes extending multi-plane technology—previously applied mainly to Fat-Tree networks—to the direct HyperX topology for large-scale AI and HPC systems. It claims that the resulting multi-plane HyperX achieves a significantly smaller network diameter and superior cost-effectiveness compared to multi-plane Fat-Tree, Dragonfly, and Dragonfly+ under equivalent configurations.
Significance. If the diameter and cost comparisons rest on identical assumptions for plane count, NIC allocation, switch radix, and link parameters across topologies, the work would offer a useful new design point for low-latency interconnects in AI data centers. The novelty of applying multi-plane concepts to direct networks is a clear strength.
major comments (2)
- [Diameter Analysis] Diameter section: the claim of significantly smaller diameter requires explicit formulas (or tabulated values) for multi-plane HyperX, multi-plane Fat-Tree, Dragonfly, and Dragonfly+ that use the exact same number of planes, NIC ports per node, and inter-plane routing assumptions. It is unclear whether the direct-network case introduces extra intra-plane hops or serialization that are absent from the indirect baselines.
- [Cost Model] Cost Model section: the cost-effectiveness comparison must demonstrate that switch, link, and NIC counts are computed with identical per-plane hardware parameters (radix, port allocation, wiring overhead) for all topologies. Any topology-specific modeling choice in the direct-network case would render the superiority claim non-comparable.
minor comments (2)
- [Abstract] Abstract: adding one or two concrete numbers (e.g., diameter reduction factor or relative cost at a given scale) would strengthen the summary of results.
- [Introduction] Introduction: include additional citations to deployed multi-plane Fat-Tree systems in AI clusters to better situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the novelty of extending multi-plane concepts to direct networks. We address each major comment below and will revise the manuscript accordingly to strengthen the clarity and comparability of our diameter and cost analyses.
read point-by-point responses
-
Referee: [Diameter Analysis] Diameter section: the claim of significantly smaller diameter requires explicit formulas (or tabulated values) for multi-plane HyperX, multi-plane Fat-Tree, Dragonfly, and Dragonfly+ that use the exact same number of planes, NIC ports per node, and inter-plane routing assumptions. It is unclear whether the direct-network case introduces extra intra-plane hops or serialization that are absent from the indirect baselines.
Authors: We agree that explicit formulas and tabulated values are necessary to make the diameter claims fully transparent and comparable. In the revised manuscript, we will add a new subsection in the Diameter Analysis section that provides closed-form expressions for the diameter of each topology (multi-plane HyperX, multi-plane Fat-Tree, Dragonfly, and Dragonfly+) under identical parameters: the same number of planes, the same number of NIC ports per node allocated to planes, the same switch radix, and the same inter-plane routing model. For multi-plane HyperX, the diameter equals the single-plane HyperX diameter because each plane is an independent direct network and packets are routed entirely within one plane; no additional intra-plane hops or serialization stages are introduced beyond the standard HyperX routing. We will also include a comparison table with numerical diameter values for representative system sizes (e.g., 4K to 64K nodes) to illustrate the advantage. These additions will directly address the concern about hidden differences between direct and indirect cases. revision: yes
-
Referee: [Cost Model] Cost Model section: the cost-effectiveness comparison must demonstrate that switch, link, and NIC counts are computed with identical per-plane hardware parameters (radix, port allocation, wiring overhead) for all topologies. Any topology-specific modeling choice in the direct-network case would render the superiority claim non-comparable.
Authors: We concur that all cost comparisons must rest on identical per-plane hardware assumptions. The revised Cost Model section will explicitly list and apply the same parameters for every topology: identical switch radix, identical port allocation per plane, identical link bandwidth and length assumptions, and identical wiring-overhead factors. We will provide the exact counting formulas for the number of switches, links, and NICs in multi-plane HyperX, multi-plane Fat-Tree, Dragonfly, and Dragonfly+, ensuring that the direct-network case uses the same per-plane modeling choices as the indirect baselines. A supplementary table will tabulate the resulting component counts and total cost for equivalent system scales. This revision will eliminate any ambiguity and substantiate the cost-effectiveness claims on a fully comparable basis. revision: yes
Circularity Check
No circularity detected; claims rest on independent topology formulas
full rationale
The paper introduces multi-plane HyperX as a novel extension of direct networks and asserts smaller diameter plus lower cost versus multi-plane Fat-Tree, Dragonfly, and Dragonfly+ using standard diameter and cost models. No equations or sections reduce a derived quantity to a fitted parameter or self-citation by construction; the abstract explicitly notes the absence of prior multi-plane work on HyperX, and comparisons invoke external topology properties rather than internal definitions. The derivation chain remains self-contained against external benchmarks with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Schreiber
Jung Ho Ahn, Nathan Binkert, Al Davis, Moray McLaren, and Robert S. Schreiber
-
[2]
HyperX: topology, routing, and packaging of efficient large-scale networks. InProceedings of the Conference on High Performance Computing Networking, Storage and Analysis(Portland, Oregon)(SC ’09). Association for Computing Machinery, New York, NY, USA, Article 41, 11 pages. doi:10.1145/1654059.1654101
-
[3]
Dezun Dong, Ziyu Wang, and Fei Lei. 2025. Zettafly: A Network Topology with Flexible Non-blocking Regions for Large-scale AI and HPC Systems. InProceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA ’25). Association for Computing Machinery, New York, NY, USA, 835–848. doi:10.1145/ 3695053.3731098
-
[4]
Dally, and Dennis Abts
John Kim, William J. Dally, and Dennis Abts. 2007. Flattened Butterfly: A Cost- Efficient Topology for High-Radix Networks. InProceedings of the 34th Annual International Symposium on Computer Architecture (ISCA ’07). Association for Computing Machinery, 126–137
2007
-
[5]
Technology-driven, highly-scalable dragonfly topology,
John Kim, Wiliam J. Dally, Steve Scott, and Dennis Abts. 2008. Technology- Driven, Highly-Scalable Dragonfly Topology. InProceedings of the 35th Annual International Symposium on Computer Architecture (ISCA ’08). IEEE Computer Society, USA, 77–88. doi:10.1109/ISCA.2008.19
-
[6]
Kun Qian, Yongqing Xi, Jiamin Cao, Jiaqi Gao, Yichi Xu, Yu Guan, Binzhang Fu, Xuemei Shi, Fangbo Zhu, Rui Miao, et al. 2024. Alibaba hpn: A data center network for large language model training. InProceedings of the ACM SIGCOMM 2024 Conference. 691–706
2024
-
[7]
Alexander Shpiner, Zachy Haramaty, Saar Eliad, Vladimir Zdornov, Barak Gafni, and Eitan Zahavi. 2017. Dragonfly+: Low Cost Topology for Scaling Datacenters. In2017 IEEE 3rd International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB). 1–8. doi:10.1109/HiPINEB. 2017.11
-
[8]
Weiyang Wang, Manya Ghobadi, Kayvon Shakeri, Ying Zhang, and Naader Hasani
-
[9]
In2024 IEEE Symposium on High-Performance Interconnects (HOTI)
Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters. In2024 IEEE Symposium on High-Performance Interconnects (HOTI). 1–10. doi:10.1109/HOTI63208.2024.00013
-
[10]
Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, et al. 2025. Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures. In Proceedings of the 52nd Annual International Symposium on Computer Architecture. 1731–1745
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.