pith. sign in

arxiv: 2605.27963 · v1 · pith:2LTIJTVCnew · submitted 2026-05-27 · 💻 cs.NI · cs.DC

Throughput-Optimized Networks at Scale

classification 💻 cs.NI cs.DC
keywords throughputnetworksscaleproblemtonsall-to-allnetworkrouting
0
0 comments X
read the original abstract

Datacenter network design plays a critical role in AI training by supporting scaling to thousands of accelerators. An open problem, designing a near-optimal throughput oriented network-topology, routing, and collectives-has not been achieved at scale and with broad applicability to physical/implementation constraints. We address this problem with a compelling use-case, Google's TPU v4/5p supercomputer where the topology may be reconfigured to achieve higher all-to-all throughput, supporting large, parallelized AI training. We show that the existing TPU networks leave terabytes per second of throughput on the table and we fill that gap. This paper presents Throughput Optimized Networks at Scale (TONS), an automated network synthesis framework that meets the high-throughput demands of modern computing. TONS formulates topology synthesis as a linear optimization problem that maximizes a throughput-centric proxy metric, using theory and heuristics to scale to thousands of nodes. We further introduce a deadlock-free routing scheme compatible with limited virtual channels and optical switch faults, enabling the synthesized topologies to realize their predicted throughput gains in simulation. Evaluating uniform random and all-to-all traffic, TONS networks have a geometric mean speedups of 2.1x and 1.6x, respectively, over the best TPU v4/5p torus variants.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.