Recognition: unknown
SAKURAONE: An Open Ethernet-Based AI HPC System and Its Observed Workload Dynamics in a Single-Tenant LLM Development Environment
Pith reviewed 2026-05-10 12:46 UTC · model grok-4.3
The pith
A GPU cluster using fully open 800 GbE networking with SONiC reaches 49th on the TOP500 list while documenting how LLM development jobs evolve from large-scale training to mid-scale refinement.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SAKURAONE achieves 33.95 PFLOP/s on HPL, 396.295 TFLOP/s on HPCG, and 339.86 PFLOP/s on HPL-MxP with FP8 while using only open 800 GbE components and SONiC. In the single-tenant LLM environment, the number of jobs is dominated by small-scale submissions, yet a small number of large-scale jobs account for the bulk of GPU resource consumption. Over the course of the project the workload distribution shifts from predominantly large training runs toward more numerous mid-scale jobs associated with refinement and iteration.
What carries the argument
The rail-optimized 800 GbE leaf-spine fabric with RoCEv2 and the SONiC open network operating system, which supplies the interconnect for the 100 nodes and enables the reported scaling without proprietary fabrics.
Load-bearing premise
That the benchmark numbers and the job traces collected during this single project's exclusive use accurately represent sustained production behavior and will recur in other LLM development settings.
What would settle it
Public logs from another single-tenant LLM cluster showing either no dominance of large jobs in GPU-time or no shift toward mid-scale jobs as the project matures, or the appearance of a second top-100 system that also uses a fully open Ethernet stack.
Figures
read the original abstract
SAKURAONE is a managed high performance computing (HPC) cluster developed and operated by the SAKURA Internet Research Center. It builds on the KOKARYOKU PHY bare metal GPU platform and is optimized for advanced workloads, including large language model (LLM) training. In ISC 2025 TOP500, SAKURAONE is ranked 49th by HPL and is the only top 100 system that uses a fully open networking stack - 800 GbE with SONiC - demonstrating the scalability of vendor-neutral technology. Measured performance is 33.95 PFLOP/s (HPL Rmax), 396.295 TFLOP/s (HPCG), and 339.86 PFLOP/s on HPL-MxP with FP8. The system consists of 100 nodes, each with eight NVIDIA H100 GPUs and a 2 PB all-flash Lustre file system, interconnected via a rail-optimized 800 GbE leaf-spine fabric with RoCEv2. Through exclusive use by a single research project, we observed the characteristics of development-related jobs. Consistent with previous HPC studies, small-scale jobs dominated in number, while a few large-scale jobs accounted for most GPU resource time. As the project progressed, resource use shifted from large-scale to mid-scale jobs, reflecting a transition from initial large-scale training to iterative refinement. These observations illustrate the real-world utilization dynamics of GPU clusters under unified project workloads.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper describes SAKURAONE, a 100-node GPU cluster (8x NVIDIA H100 per node) with a 2 PB Lustre filesystem, built on the KOKARYOKU PHY platform and interconnected by a rail-optimized 800 GbE leaf-spine fabric using RoCEv2 and the open SONiC stack. It reports HPL Rmax of 33.95 PFLOP/s (rank 49 on ISC 2025 TOP500), HPCG of 396.295 TFLOP/s, and HPL-MxP FP8 of 339.86 PFLOP/s, asserts uniqueness among top-100 systems for a fully open 800 GbE networking stack, and presents empirical workload statistics from exclusive single-tenant use by an LLM development project, noting that small jobs dominate in count while a few large jobs dominate GPU-hours and that usage shifted from large- to mid-scale jobs over the project lifetime.
Significance. If the benchmark numbers and uniqueness claim hold, the work demonstrates the practical scalability of vendor-neutral Ethernet fabrics for AI-scale HPC at TOP500 levels and supplies concrete, production-derived statistics on LLM workload evolution that can inform scheduler design and capacity planning. The clear reporting of standard benchmarks (HPL, HPCG, HPL-MxP) and the observational nature of the workload data constitute verifiable contributions.
major comments (1)
- [Workload Dynamics / observed job statistics] The workload-dynamics section relies on job-log analysis but provides no explicit description of data collection, filtering criteria, job definition, handling of failed or queued jobs, or the exact observation window; these omissions limit reproducibility and make it difficult to assess whether the reported shift from large- to mid-scale jobs is robust or sensitive to processing choices.
minor comments (2)
- [Introduction / benchmark results] The claim that SAKURAONE is 'the only top 100 system that uses a fully open networking stack' should be accompanied by a brief footnote or reference to the TOP500 methodology or survey used to establish uniqueness.
- [Figures] Figure captions and axis labels for any workload histograms or time-series plots should explicitly state the binning method and the total number of jobs or GPU-hours represented.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation for minor revision. The single major comment identifies a clear opportunity to strengthen the reproducibility of the workload analysis, which we will address directly in the revised manuscript.
read point-by-point responses
-
Referee: The workload-dynamics section relies on job-log analysis but provides no explicit description of data collection, filtering criteria, job definition, handling of failed or queued jobs, or the exact observation window; these omissions limit reproducibility and make it difficult to assess whether the reported shift from large- to mid-scale jobs is robust or sensitive to processing choices.
Authors: We agree that the current description is insufficient for full reproducibility. In the revised version we will insert a new subsection (Section 4.1) that explicitly states: (1) data were extracted from the Slurm accounting database via sacct queries over the period 2024-03-01 to 2024-12-15; (2) a job is defined as any allocation with at least one GPU hour; (3) filtering removed only system-reserved maintenance jobs and jobs with zero GPU time; (4) failed and queued jobs were logged separately but excluded from the utilization histograms and GPU-hour totals; and (5) the large-to-mid-scale transition remains statistically significant (Kolmogorov-Smirnov p < 0.01) under alternative binning thresholds of 32, 64, and 128 GPUs. We will also release the anonymized job-log summary tables as supplementary material. revision: yes
Circularity Check
No significant circularity; paper is purely descriptive
full rationale
The manuscript contains no mathematical derivations, equations, fitted parameters, predictions, or models. All claims rest on external verifiable benchmarks (TOP500 HPL ranking and Rmax value), direct hardware specifications, and empirical job-log statistics from a single-tenant deployment. No self-citations are load-bearing, no ansatz is smuggled, and no result is renamed or redefined in terms of itself. The derivation chain is therefore empty and self-contained against external data.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Chal- lenges in computing resource sharing towards next-gen interactive accelerated HPC
Endo, T., Minami, S., Nomura, A., Ohtsuji, H., Kato, J., Miwa, M., Yoshida, E., Yuki, T., and Sakamoto, R. Chal- lenges in computing resource sharing towards next-gen interactive accelerated HPC. InHigh Performance Com- puting. ISC High Performance 2024 International Work- shops: Hamburg, Germany, May 12–16, 2024, Revised Selected Papers, pp. 231–242,
2024
-
[2]
J., Goes, G., Morsy, H., Puri, R., Riftadi, M., Shetty, A
Gangidi, A., Miao, R., Zheng, S., Bondu, S. J., Goes, G., Morsy, H., Puri, R., Riftadi, M., Shetty, A. J., Yang, J., Zhang, S., Fernandez, M. J., Gandham, S., and Zeng, H. RDMA over Ethernet for distributed training at Meta scale. InProceedings of the ACM SIGCOMM 2024 Con- ference, pp. 57–70,
2024
-
[3]
An empirical study on low GPU utilization of deep learning jobs
Gao, Y ., He, Y ., Li, X., Zhao, B., Lin, H., Liang, Y ., Zhong, J., Zhang, H., Wang, J., Zeng, Y ., Gui, K., Tong, J., and Yang, M. An empirical study on low GPU utilization of deep learning jobs. InIEEE/ACM 46th International Conference on Software Engineering (ICSE 2024), pp. 1–13,
2024
-
[4]
RDMA over commodity Ethernet at scale
Guo, C., Wu, H., Deng, Z., Soni, G., Ye, J., Padhye, J., and Lipshteyn, M. RDMA over commodity Ethernet at scale. InProceedings of the 2016 ACM SIGCOMM Conference, pp. 202–215,
2016
-
[5]
Data- center Ethernet and RDMA: Issues at hyperscale.arXiv preprint arXiv:2302.03337,
Hoefler, T., Roweth, D., Underwood, K., Alverson, B., Gris- wold, M., Tabatabaee, V ., Kalkunte, M., Anubolu, S., Shen, S., Kabbani, A., McLaren, M., and Scott, S. Data- center Ethernet and RDMA: Issues at hyperscale.arXiv preprint arXiv:2302.03337,
-
[6]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Le Scao, T., Fan, A., et al. BLOOM: A 176B-parameter open-access multilingual language model.arXiv preprint arXiv:2211.05100,
work page internal anchor Pith review arXiv
-
[7]
J., Robie, T., St John, T., Wu, C.-J., Xu, L., Young, C., and Zaharia, M
SAKURAONE Mattson, P., Cheng, C., Diamos, G., Coleman, C., Micike- vicius, P., Patterson, D., Tang, H., Wei, G.-Y ., Bailis, P., Bittorf, V ., Brooks, D., Chen, D., Dutta, D., Gupta, U., Hazelwood, K., Hock, A., Huang, X., Kang, D., Kanter, D., Kumar, N., Liao, J., Narayanan, D., Oguntebi, T., Pekhimenko, G., Pentecost, L., Reddi, V . J., Robie, T., St Jo...
2020
-
[8]
ABCI 3.0: Evolution of the leading AI in- frastructure in Japan.arXiv preprint arXiv:2411.09134,
Takano, R., Takizawa, S., Tanimura, Y ., Nakada, H., and Ogawa, H. ABCI 3.0: Evolution of the leading AI in- frastructure in Japan.arXiv preprint arXiv:2411.09134,
-
[9]
SONiC: Software for open networking in the cloud
Yuan, L. SONiC: Software for open networking in the cloud. Slide deck, APNet 2018 (2nd Asia- Pacific Workshop on Networking),
2018
-
[10]
Congestion control for large-scale RDMA deployments
Zhu, Y ., Eran, H., Firestone, D., Guo, C., Lipshteyn, M., Liron, Y ., Padhye, J., Raindel, S., Haj Yahia, M., and Zhang, M. Congestion control for large-scale RDMA deployments. InACM SIGCOMM Computer Communi- cation Review, volume 45, pp. 523–536, 2015
2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.