arxiv: 2604.23519 · v1 · submitted 2026-04-26 · 💻 cs.NI · cs.LG

Recognition: unknown

Multi-Plane HyperX: A Low-Latency and Cost-Effective Network for Large-Scale AI and HPC Systems

Dezun Dong, Fei Lei, Ziyu Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 05:17 UTC · model grok-4.3

classification 💻 cs.NI cs.LG

keywords HyperXmulti-planenetwork topologydiametercost-effectivenessAI data centerHPCFat-Tree

0 comments

The pith

Multi-plane HyperX provides smaller diameter and higher cost-effectiveness than multi-plane Fat-Tree or Dragonfly for AI and HPC systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to extend multi-plane technology from Fat-Tree networks to the HyperX direct topology for use in AI data centers and HPC systems. It claims this creates a network with smaller diameter than the multi-plane versions of Fat-Tree, Dragonfly, and Dragonfly+. The design is also more cost-effective at large scales. A reader would care if true because smaller diameter means fewer hops and thus lower latency for data movement in parallel computations, while cost savings matter for building ever-larger training clusters.

Core claim

This paper investigates the multi-plane HyperX network and demonstrates that, compared to state-of-the-art network topologies like multi-plane Fat-Tree, Dragonfly, and Dragonfly+, the multi-plane HyperX architecture achieves a significantly smaller network diameter and superior cost-effectiveness.

What carries the argument

Multi-plane HyperX architecture: allocating multiple NIC ports or NICs to independent planes in a HyperX topology to enable shorter paths and efficient scaling.

If this is right

Large-scale AI training jobs experience reduced communication latency from the smaller diameter.
Total ownership costs decrease for clusters of equivalent size and performance.
Network designers gain a new option for balancing latency and expense in direct networks.
Scalability improves as more nodes can be added without diameter growing as fast.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hardware vendors could prioritize NICs with more ports to support such multi-plane designs.
Routing protocols might need adaptation to exploit the plane structure for even better performance.
Similar multi-planing could be applied to other direct networks to check for comparable benefits.

Load-bearing premise

That the multi-plane approach transfers to HyperX without adding unforeseen routing overheads or requiring non-comparable cost models.

What would settle it

A side-by-side benchmark of packet latencies and total hardware costs in simulated or built 10,000-node instances of each topology under AI-like traffic patterns.

Figures

Figures reproduced from arXiv: 2604.23519 by Dezun Dong, Fei Lei, Ziyu Wang.

**Figure 1.** Figure 1: 4-plane 1D HyperX (MPHX(4,4,4)) network. Each view at source ↗

read the original abstract

Multi-plane architectures have become increasingly prevalent in the Fat-Tree networks of AI data centers. By leveraging multiple ports on a single network interface card (NIC) or multiple NICs within a scale-up domain, each port or NIC is allocated to an independent network plane, thereby provisioning the overall system with multiple network planes. However, no prior literature has explored the application of multi-plane technologies to direct networks such as HyperX. This paper investigates the multi-plane HyperX network and demonstrates that, compared to state-of-the-art network topologies like multi-plane Fat-Tree, Dragonfly, and Dragonfly+, the multi-plane HyperX architecture achieves a significantly smaller network diameter and superior cost-effectiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes extending multi-plane technology—previously applied mainly to Fat-Tree networks—to the direct HyperX topology for large-scale AI and HPC systems. It claims that the resulting multi-plane HyperX achieves a significantly smaller network diameter and superior cost-effectiveness compared to multi-plane Fat-Tree, Dragonfly, and Dragonfly+ under equivalent configurations.

Significance. If the diameter and cost comparisons rest on identical assumptions for plane count, NIC allocation, switch radix, and link parameters across topologies, the work would offer a useful new design point for low-latency interconnects in AI data centers. The novelty of applying multi-plane concepts to direct networks is a clear strength.

major comments (2)

[Diameter Analysis] Diameter section: the claim of significantly smaller diameter requires explicit formulas (or tabulated values) for multi-plane HyperX, multi-plane Fat-Tree, Dragonfly, and Dragonfly+ that use the exact same number of planes, NIC ports per node, and inter-plane routing assumptions. It is unclear whether the direct-network case introduces extra intra-plane hops or serialization that are absent from the indirect baselines.
[Cost Model] Cost Model section: the cost-effectiveness comparison must demonstrate that switch, link, and NIC counts are computed with identical per-plane hardware parameters (radix, port allocation, wiring overhead) for all topologies. Any topology-specific modeling choice in the direct-network case would render the superiority claim non-comparable.

minor comments (2)

[Abstract] Abstract: adding one or two concrete numbers (e.g., diameter reduction factor or relative cost at a given scale) would strengthen the summary of results.
[Introduction] Introduction: include additional citations to deployed multi-plane Fat-Tree systems in AI clusters to better situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the novelty of extending multi-plane concepts to direct networks. We address each major comment below and will revise the manuscript accordingly to strengthen the clarity and comparability of our diameter and cost analyses.

read point-by-point responses

Referee: [Diameter Analysis] Diameter section: the claim of significantly smaller diameter requires explicit formulas (or tabulated values) for multi-plane HyperX, multi-plane Fat-Tree, Dragonfly, and Dragonfly+ that use the exact same number of planes, NIC ports per node, and inter-plane routing assumptions. It is unclear whether the direct-network case introduces extra intra-plane hops or serialization that are absent from the indirect baselines.

Authors: We agree that explicit formulas and tabulated values are necessary to make the diameter claims fully transparent and comparable. In the revised manuscript, we will add a new subsection in the Diameter Analysis section that provides closed-form expressions for the diameter of each topology (multi-plane HyperX, multi-plane Fat-Tree, Dragonfly, and Dragonfly+) under identical parameters: the same number of planes, the same number of NIC ports per node allocated to planes, the same switch radix, and the same inter-plane routing model. For multi-plane HyperX, the diameter equals the single-plane HyperX diameter because each plane is an independent direct network and packets are routed entirely within one plane; no additional intra-plane hops or serialization stages are introduced beyond the standard HyperX routing. We will also include a comparison table with numerical diameter values for representative system sizes (e.g., 4K to 64K nodes) to illustrate the advantage. These additions will directly address the concern about hidden differences between direct and indirect cases. revision: yes
Referee: [Cost Model] Cost Model section: the cost-effectiveness comparison must demonstrate that switch, link, and NIC counts are computed with identical per-plane hardware parameters (radix, port allocation, wiring overhead) for all topologies. Any topology-specific modeling choice in the direct-network case would render the superiority claim non-comparable.

Authors: We concur that all cost comparisons must rest on identical per-plane hardware assumptions. The revised Cost Model section will explicitly list and apply the same parameters for every topology: identical switch radix, identical port allocation per plane, identical link bandwidth and length assumptions, and identical wiring-overhead factors. We will provide the exact counting formulas for the number of switches, links, and NICs in multi-plane HyperX, multi-plane Fat-Tree, Dragonfly, and Dragonfly+, ensuring that the direct-network case uses the same per-plane modeling choices as the indirect baselines. A supplementary table will tabulate the resulting component counts and total cost for equivalent system scales. This revision will eliminate any ambiguity and substantiate the cost-effectiveness claims on a fully comparable basis. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on independent topology formulas

full rationale

The paper introduces multi-plane HyperX as a novel extension of direct networks and asserts smaller diameter plus lower cost versus multi-plane Fat-Tree, Dragonfly, and Dragonfly+ using standard diameter and cost models. No equations or sections reduce a derived quantity to a fitted parameter or self-citation by construction; the abstract explicitly notes the absence of prior multi-plane work on HyperX, and comparisons invoke external topology properties rather than internal definitions. The derivation chain remains self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no specific free parameters, axioms, or invented entities can be identified. Central claim rests on unshown analytic or simulation-based calculations of network diameter and cost for multiple topologies.

pith-pipeline@v0.9.0 · 5419 in / 1194 out tokens · 40033 ms · 2026-05-08T05:17:37.646004+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 5 canonical work pages

[1]

Schreiber

Jung Ho Ahn, Nathan Binkert, Al Davis, Moray McLaren, and Robert S. Schreiber
[2]

InProceedings of the Conference on High Performance Computing Networking, Storage and Analysis(Portland, Oregon)(SC ’09)

HyperX: topology, routing, and packaging of efficient large-scale networks. InProceedings of the Conference on High Performance Computing Networking, Storage and Analysis(Portland, Oregon)(SC ’09). Association for Computing Machinery, New York, NY, USA, Article 41, 11 pages. doi:10.1145/1654059.1654101

work page doi:10.1145/1654059.1654101
[3]

Dezun Dong, Ziyu Wang, and Fei Lei. 2025. Zettafly: A Network Topology with Flexible Non-blocking Regions for Large-scale AI and HPC Systems. InProceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA ’25). Association for Computing Machinery, New York, NY, USA, 835–848. doi:10.1145/ 3695053.3731098

work page arXiv 2025
[4]

Dally, and Dennis Abts

John Kim, William J. Dally, and Dennis Abts. 2007. Flattened Butterfly: A Cost- Efficient Topology for High-Radix Networks. InProceedings of the 34th Annual International Symposium on Computer Architecture (ISCA ’07). Association for Computing Machinery, 126–137

2007
[5]

Technology-driven, highly-scalable dragonfly topology,

John Kim, Wiliam J. Dally, Steve Scott, and Dennis Abts. 2008. Technology- Driven, Highly-Scalable Dragonfly Topology. InProceedings of the 35th Annual International Symposium on Computer Architecture (ISCA ’08). IEEE Computer Society, USA, 77–88. doi:10.1109/ISCA.2008.19

work page doi:10.1109/isca.2008.19 2008
[6]

Kun Qian, Yongqing Xi, Jiamin Cao, Jiaqi Gao, Yichi Xu, Yu Guan, Binzhang Fu, Xuemei Shi, Fangbo Zhu, Rui Miao, et al. 2024. Alibaba hpn: A data center network for large language model training. InProceedings of the ACM SIGCOMM 2024 Conference. 691–706

2024
[7]

Alexander Shpiner, Zachy Haramaty, Saar Eliad, Vladimir Zdornov, Barak Gafni, and Eitan Zahavi. 2017. Dragonfly+: Low Cost Topology for Scaling Datacenters. In2017 IEEE 3rd International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB). 1–8. doi:10.1109/HiPINEB. 2017.11

work page doi:10.1109/hipineb 2017
[8]

Weiyang Wang, Manya Ghobadi, Kayvon Shakeri, Ying Zhang, and Naader Hasani
[9]

In2024 IEEE Symposium on High-Performance Interconnects (HOTI)

Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters. In2024 IEEE Symposium on High-Performance Interconnects (HOTI). 1–10. doi:10.1109/HOTI63208.2024.00013

work page doi:10.1109/hoti63208.2024.00013 2024
[10]

Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, et al. 2025. Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures. In Proceedings of the 52nd Annual International Symposium on Computer Architecture. 1731–1745

2025