arxiv: 2604.13600 · v2 · submitted 2026-04-15 · 💻 cs.DC · cs.NI

Recognition: unknown

SAKURAONE: An Open Ethernet-Based AI HPC System and Its Observed Workload Dynamics in a Single-Tenant LLM Development Environment

Fumikazu Konishi, Hirofumi Tsuruta, Yuuki Tsubouchi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:46 UTC · model grok-4.3

classification 💻 cs.DC cs.NI

keywords HPC clusterLLM trainingworkload characterizationopen networkingEthernet fabricGPU utilizationTOP500SONiC

0 comments

The pith

A GPU cluster using fully open 800 GbE networking with SONiC reaches 49th on the TOP500 list while documenting how LLM development jobs evolve from large-scale training to mid-scale refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SAKURAONE as a 100-node system with eight H100 GPUs per node and a 2 PB Lustre file system, built around an 800 GbE leaf-spine fabric using RoCEv2 and the SONiC open network operating system. It reports benchmark results that place the machine 49th on the ISC 2025 TOP500 list by HPL performance, noting that it is the only top-100 entry with a completely vendor-neutral networking stack. The authors then analyze job traces from its exclusive use by one LLM research project, finding that small jobs dominate the count while a few large jobs consume most GPU hours and that the mix shifts toward mid-scale jobs as the project moves from initial training to iterative work. A reader would care because these results test whether open Ethernet technology can meet the demands of current AI workloads and because the single-tenant setting supplies concrete data on how real development pipelines actually use the hardware.

Core claim

SAKURAONE achieves 33.95 PFLOP/s on HPL, 396.295 TFLOP/s on HPCG, and 339.86 PFLOP/s on HPL-MxP with FP8 while using only open 800 GbE components and SONiC. In the single-tenant LLM environment, the number of jobs is dominated by small-scale submissions, yet a small number of large-scale jobs account for the bulk of GPU resource consumption. Over the course of the project the workload distribution shifts from predominantly large training runs toward more numerous mid-scale jobs associated with refinement and iteration.

What carries the argument

The rail-optimized 800 GbE leaf-spine fabric with RoCEv2 and the SONiC open network operating system, which supplies the interconnect for the 100 nodes and enables the reported scaling without proprietary fabrics.

Load-bearing premise

That the benchmark numbers and the job traces collected during this single project's exclusive use accurately represent sustained production behavior and will recur in other LLM development settings.

What would settle it

Public logs from another single-tenant LLM cluster showing either no dominance of large jobs in GPU-time or no shift toward mid-scale jobs as the project matures, or the appearance of a second top-100 system that also uses a fully open Ethernet stack.

Figures

Figures reproduced from arXiv: 2604.13600 by Fumikazu Konishi, Hirofumi Tsuruta, Yuuki Tsubouchi.

**Figure 1.** Figure 1: overviews the system: 100 compute nodes, each with eight NVIDIA H100 GPUs (800 GPUs total); a 2 PB all-flash Lustre storage subsystem for high-throughput, lowlatency data access; a full-bisection-bandwidth interconnect in a rail-optimized topology over RoCEv2 for fast multinode communication; and secure, high-speed VPN access to interactive front-end nodes for efficient remote use. *Equal contribution 1R… view at source ↗

**Figure 2.** Figure 2: SAKURAONE System Detail brary versions for controlled upgrades and reproducible workflows. • Programming Models and GPU Libraries. Standard parallel models (MPI, OpenMP) coexist with recent GPU toolchains. Multiple CUDA 12.x versions and optimized libraries (cuDNN, NCCL) support both traditional simulations and DNN training/inference. • Containers and Portability. Singularity/Apptainer with Pyxis integrat… view at source ↗

**Figure 3.** Figure 3: Distribution of job states by (a) job count and (b) GPUoccupied time. 2025, we executed continued pretraining (CPT) on Llama3.1-70B-instruct and Qwen2.5-72B-instruct on SAKURAONE, followed by instruction tuning for EHR→standardcode mapping. The present section summarizes operational observations from these runs. 7.2 Key Observations Observation 1: User-initiated cancellations dominate GPU-occupied time… view at source ↗

**Figure 4.** Figure 4: Distribution of jobs by node count. Fraction of total job count (blue) and GPU-occupied time (orange) for each job size category. time, respectively. In contrast, jobs using 17 nodes or more represented only 3.3% of job count but consumed 73.3% of GPU-occupied time. This pattern, in which small-scale jobs dominate numerically but large-scale jobs dominate resource consumption, has been reported in other l… view at source ↗

**Figure 5.** Figure 5: Per-job GPU utilization by job size (nodes). (a) Distribution of average GPU utilization. (b) Distribution of the proportion of GPU-occupied time spent in low-utilization states (GPU utilization below 20%). over GPU computation. These results provide quantitative evidence that GPU utilization varies systematically with job scale, reflecting the diverse computational demands across workload types within a… view at source ↗

**Figure 6.** Figure 6: Cumulative distribution of job runtimes by node count. 2025-01-01 2025-01-15 2025-02-01 2025-02-15 2025-03-01 2025-03-15 2025-04-01 Date 0 25 50 75 100 125 150 Job Count Node Count 1 2 3-4 5-8 9-16 17-32 33-64 [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Daily job submissions by node count. scale jobs using 3-16 nodes (shown in blue shades) gradually increased in frequency, likely marking a transition to the fine-tuning phase where the pretrained LLMs were adapted to downstream, task-specific datasets. This temporal shift captures a typical development pattern within a single LLM project—an initial large-scale pretraining phase followed by a medium-scale f… view at source ↗

read the original abstract

SAKURAONE is a managed high performance computing (HPC) cluster developed and operated by the SAKURA Internet Research Center. It builds on the KOKARYOKU PHY bare metal GPU platform and is optimized for advanced workloads, including large language model (LLM) training. In ISC 2025 TOP500, SAKURAONE is ranked 49th by HPL and is the only top 100 system that uses a fully open networking stack - 800 GbE with SONiC - demonstrating the scalability of vendor-neutral technology. Measured performance is 33.95 PFLOP/s (HPL Rmax), 396.295 TFLOP/s (HPCG), and 339.86 PFLOP/s on HPL-MxP with FP8. The system consists of 100 nodes, each with eight NVIDIA H100 GPUs and a 2 PB all-flash Lustre file system, interconnected via a rail-optimized 800 GbE leaf-spine fabric with RoCEv2. Through exclusive use by a single research project, we observed the characteristics of development-related jobs. Consistent with previous HPC studies, small-scale jobs dominated in number, while a few large-scale jobs accounted for most GPU resource time. As the project progressed, resource use shifted from large-scale to mid-scale jobs, reflecting a transition from initial large-scale training to iterative refinement. These observations illustrate the real-world utilization dynamics of GPU clusters under unified project workloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper documents a real 100-node H100 cluster running fully open 800 GbE with SONiC that hit rank 49 on TOP500, plus job-size patterns from one single-tenant LLM project.

read the letter

The main takeaway is that SAKURAONE proves an all-open Ethernet stack can reach top-100 HPC scale for LLM work without proprietary interconnects. The system uses 100 nodes of eight H100 GPUs each, a 2 PB Lustre file system, and a rail-optimized 800 GbE leaf-spine fabric with RoCEv2 and SONiC. It reports 33.95 PFLOP/s on HPL, 396 TFLOP/s on HPCG, and 339.86 PFLOP/s on HPL-MxP with FP8, which placed it 49th in the ISC 2025 list and made it the only top-100 entry with that networking setup. The workload section logs how one dedicated research project used the machine: small jobs dominated the count while a few large ones took most GPU time, then usage shifted toward mid-scale jobs as the team moved from initial large training runs to iterative refinement. That matches older HPC patterns but adds a phase-dependent view specific to single-tenant LLM development. The description of the fabric and the benchmark numbers are clear and rest on standard tests. The open-stack claim is externally checkable via TOP500. The workload observations are direct measurements from the actual deployment. The main limitation is that everything comes from one project and one site, so the job-mix shift may not hold in multi-tenant or different research settings. The paper gives no extra detail on exactly how the logs were collected or filtered, which leaves a small gap for anyone wanting to replicate the analysis. No new models or derivations appear. This is useful for people designing or studying large GPU clusters who want evidence on open networking alternatives and real workload dynamics. Cluster operators and researchers tracking AI infrastructure trends would get concrete numbers and a usable data point. The facts are solid enough and the claims verifiable, so it deserves a serious referee rather than a desk reject.

Referee Report

1 major / 2 minor

Summary. The paper describes SAKURAONE, a 100-node GPU cluster (8x NVIDIA H100 per node) with a 2 PB Lustre filesystem, built on the KOKARYOKU PHY platform and interconnected by a rail-optimized 800 GbE leaf-spine fabric using RoCEv2 and the open SONiC stack. It reports HPL Rmax of 33.95 PFLOP/s (rank 49 on ISC 2025 TOP500), HPCG of 396.295 TFLOP/s, and HPL-MxP FP8 of 339.86 PFLOP/s, asserts uniqueness among top-100 systems for a fully open 800 GbE networking stack, and presents empirical workload statistics from exclusive single-tenant use by an LLM development project, noting that small jobs dominate in count while a few large jobs dominate GPU-hours and that usage shifted from large- to mid-scale jobs over the project lifetime.

Significance. If the benchmark numbers and uniqueness claim hold, the work demonstrates the practical scalability of vendor-neutral Ethernet fabrics for AI-scale HPC at TOP500 levels and supplies concrete, production-derived statistics on LLM workload evolution that can inform scheduler design and capacity planning. The clear reporting of standard benchmarks (HPL, HPCG, HPL-MxP) and the observational nature of the workload data constitute verifiable contributions.

major comments (1)

[Workload Dynamics / observed job statistics] The workload-dynamics section relies on job-log analysis but provides no explicit description of data collection, filtering criteria, job definition, handling of failed or queued jobs, or the exact observation window; these omissions limit reproducibility and make it difficult to assess whether the reported shift from large- to mid-scale jobs is robust or sensitive to processing choices.

minor comments (2)

[Introduction / benchmark results] The claim that SAKURAONE is 'the only top 100 system that uses a fully open networking stack' should be accompanied by a brief footnote or reference to the TOP500 methodology or survey used to establish uniqueness.
[Figures] Figure captions and axis labels for any workload histograms or time-series plots should explicitly state the binning method and the total number of jobs or GPU-hours represented.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. The single major comment identifies a clear opportunity to strengthen the reproducibility of the workload analysis, which we will address directly in the revised manuscript.

read point-by-point responses

Referee: The workload-dynamics section relies on job-log analysis but provides no explicit description of data collection, filtering criteria, job definition, handling of failed or queued jobs, or the exact observation window; these omissions limit reproducibility and make it difficult to assess whether the reported shift from large- to mid-scale jobs is robust or sensitive to processing choices.

Authors: We agree that the current description is insufficient for full reproducibility. In the revised version we will insert a new subsection (Section 4.1) that explicitly states: (1) data were extracted from the Slurm accounting database via sacct queries over the period 2024-03-01 to 2024-12-15; (2) a job is defined as any allocation with at least one GPU hour; (3) filtering removed only system-reserved maintenance jobs and jobs with zero GPU time; (4) failed and queued jobs were logged separately but excluded from the utilization histograms and GPU-hour totals; and (5) the large-to-mid-scale transition remains statistically significant (Kolmogorov-Smirnov p < 0.01) under alternative binning thresholds of 32, 64, and 128 GPUs. We will also release the anonymized job-log summary tables as supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity; paper is purely descriptive

full rationale

The manuscript contains no mathematical derivations, equations, fitted parameters, predictions, or models. All claims rest on external verifiable benchmarks (TOP500 HPL ranking and Rmax value), direct hardware specifications, and empirical job-log statistics from a single-tenant deployment. No self-citations are load-bearing, no ansatz is smuggled, and no result is renamed or redefined in terms of itself. The derivation chain is therefore empty and self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The report relies on standard benchmark definitions (HPL, HPCG) and direct system measurements without introducing new parameters, axioms, or entities.

pith-pipeline@v0.9.0 · 5587 in / 1132 out tokens · 44027 ms · 2026-05-10T12:46:09.580954+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Chal- lenges in computing resource sharing towards next-gen interactive accelerated HPC

Endo, T., Minami, S., Nomura, A., Ohtsuji, H., Kato, J., Miwa, M., Yoshida, E., Yuki, T., and Sakamoto, R. Chal- lenges in computing resource sharing towards next-gen interactive accelerated HPC. InHigh Performance Com- puting. ISC High Performance 2024 International Work- shops: Hamburg, Germany, May 12–16, 2024, Revised Selected Papers, pp. 231–242,

2024
[2]

J., Goes, G., Morsy, H., Puri, R., Riftadi, M., Shetty, A

Gangidi, A., Miao, R., Zheng, S., Bondu, S. J., Goes, G., Morsy, H., Puri, R., Riftadi, M., Shetty, A. J., Yang, J., Zhang, S., Fernandez, M. J., Gandham, S., and Zeng, H. RDMA over Ethernet for distributed training at Meta scale. InProceedings of the ACM SIGCOMM 2024 Con- ference, pp. 57–70,

2024
[3]

An empirical study on low GPU utilization of deep learning jobs

Gao, Y ., He, Y ., Li, X., Zhao, B., Lin, H., Liang, Y ., Zhong, J., Zhang, H., Wang, J., Zeng, Y ., Gui, K., Tong, J., and Yang, M. An empirical study on low GPU utilization of deep learning jobs. InIEEE/ACM 46th International Conference on Software Engineering (ICSE 2024), pp. 1–13,

2024
[4]

RDMA over commodity Ethernet at scale

Guo, C., Wu, H., Deng, Z., Soni, G., Ye, J., Padhye, J., and Lipshteyn, M. RDMA over commodity Ethernet at scale. InProceedings of the 2016 ACM SIGCOMM Conference, pp. 202–215,

2016
[5]

Data- center Ethernet and RDMA: Issues at hyperscale.arXiv preprint arXiv:2302.03337,

Hoefler, T., Roweth, D., Underwood, K., Alverson, B., Gris- wold, M., Tabatabaee, V ., Kalkunte, M., Anubolu, S., Shen, S., Kabbani, A., McLaren, M., and Scott, S. Data- center Ethernet and RDMA: Issues at hyperscale.arXiv preprint arXiv:2302.03337,

work page arXiv
[6]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Le Scao, T., Fan, A., et al. BLOOM: A 176B-parameter open-access multilingual language model.arXiv preprint arXiv:2211.05100,

work page internal anchor Pith review arXiv
[7]

J., Robie, T., St John, T., Wu, C.-J., Xu, L., Young, C., and Zaharia, M

SAKURAONE Mattson, P., Cheng, C., Diamos, G., Coleman, C., Micike- vicius, P., Patterson, D., Tang, H., Wei, G.-Y ., Bailis, P., Bittorf, V ., Brooks, D., Chen, D., Dutta, D., Gupta, U., Hazelwood, K., Hock, A., Huang, X., Kang, D., Kanter, D., Kumar, N., Liao, J., Narayanan, D., Oguntebi, T., Pekhimenko, G., Pentecost, L., Reddi, V . J., Robie, T., St Jo...

2020
[8]

ABCI 3.0: Evolution of the leading AI in- frastructure in Japan.arXiv preprint arXiv:2411.09134,

Takano, R., Takizawa, S., Tanimura, Y ., Nakada, H., and Ogawa, H. ABCI 3.0: Evolution of the leading AI in- frastructure in Japan.arXiv preprint arXiv:2411.09134,

work page arXiv
[9]

SONiC: Software for open networking in the cloud

Yuan, L. SONiC: Software for open networking in the cloud. Slide deck, APNet 2018 (2nd Asia- Pacific Workshop on Networking),

2018
[10]

Congestion control for large-scale RDMA deployments

Zhu, Y ., Eran, H., Firestone, D., Guo, C., Lipshteyn, M., Liron, Y ., Padhye, J., Raindel, S., Haj Yahia, M., and Zhang, M. Congestion control for large-scale RDMA deployments. InACM SIGCOMM Computer Communi- cation Review, volume 45, pp. 523–536, 2015

2015