arxiv: 2605.07726 · v1 · submitted 2026-05-08 · 💻 cs.DC

Recognition: no theorem link

A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

Ajay Navilarekal Rajgopal , Nikolai Solmsdorf

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:17 UTC · model grok-4.3

classification 💻 cs.DC

keywords large language modelsdistributed traininghigh performance computingtensor parallelismpipeline parallelismdata parallelismscaling efficiencyGPU training

0 comments

The pith

Standard tensor, pipeline and data parallelism trains 175B language models at 10 percent of peak performance on SuperMUC-NG Phase 2.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to combine tensor parallelism, pipeline parallelism, and sharded data parallelism to train language models with up to 175 billion parameters on the SuperMUC-NG Phase 2 supercomputer. This matters to a sympathetic reader because training such models demands enormous resources, and demonstrating that off-the-shelf software reaches usable efficiency on real hardware opens access to more teams. The authors tune the mix of these techniques and measure their effects on throughput and scaling across different model sizes. If the results hold, users can run large-scale training on this system by following the described recipe with unmodified public tools.

Core claim

By integrating tensor parallelism, pipeline parallelism, and sharded data parallelism with hyperparameter tuning on SuperMUC-NG Phase 2, the authors reach 10 percent of theoretical peak per-tile bf16 FLOPs for a 175B model using an unmodified publicly available software stack, together with 93 percent weak scaling efficiency and 82 percent strong scaling efficiency on 128 nodes.

What carries the argument

The combined strategy of tensor parallelism, pipeline parallelism, and sharded data parallelism that distributes model layers and data shards across GPU tiles while controlling communication costs.

If this is right

Training of models up to 175 billion parameters becomes practical on this hardware with standard tools.
High weak scaling efficiency supports efficient expansion to additional nodes for larger workloads.
The same tuned combination delivers usable performance across a range of model sizes.
Strong scaling to 128 nodes keeps training times manageable on the available accelerators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same balance of parallelism techniques could serve as a starting point on other systems that use similar GPU accelerators.
Further tests on models beyond 175B parameters would reveal where additional tuning becomes necessary.
Wider adoption of the recipe could increase the number of groups able to experiment with foundational models on public HPC resources.

Load-bearing premise

The measured throughput and scaling efficiencies obtained after tuning will be achieved by other users on the same system with the same standard software distributions and no extra engineering.

What would settle it

An independent run of the 175B model on SuperMUC-NG Phase 2 using the described parallelism settings and unmodified software that falls well below 10 percent of peak throughput or shows scaling efficiencies under 80 percent would undermine the accessibility claim.

Figures

Figures reproduced from arXiv: 2605.07726 by Ajay Navilarekal Rajgopal, Nikolai Solmsdorf.

**Figure 1.** Figure 1: Throughput vs TP degree for 3.6B model 4.2 Pipeline Parallelism Experimental setup. We use a 20B-parameter model to study how pipeline-parallel degree affects sustained throughput. The primary factor is the pipeline bubble, whose size scales with the ratio P P M , where P P is the number of pipeline stages and M is the number of micro-batches per optimizer step. Sweep configuration. We run three targeted e… view at source ↗

**Figure 2.** Figure 2: Experimental results showing Throughput and Gain vs. Number of Micro [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Experimental results showing Throughput vs. Number of pipeline stages [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Search Trajectory nodes on SuperMUC-NG Phase 2 (1024 tiles, over 50% of the system’s compute nodes) to quantify weak and strong scaling behavior. 6.1 Weak Scaling In weak scaling, we increase global batch size with tile count while keeping pertile workload approximately constant. Scaling to 128 nodes (8× baseline) yields weak-scaling efficiency of ∼ 93%, indicating near-ideal throughput preservation over … view at source ↗

**Figure 5.** Figure 5: Weak and strong scaling efficiencies Checklist – Restrict tensor parallelism to a single node (i.e., T P ≤ 8 on SuperMUC-NG Phase 2). – Use enough micro-batches to keep the pipeline full and minimize bubble overhead. – Scale out primarily via data parallelism once the model parallel efficiency is saturated. SuperMUC-NG Phase 2 operates in a power-capped production mode; in particular, Intel Data Center GPU… view at source ↗

read the original abstract

Large Language Models (LLMs) continue to demonstrate superior performance with increasing scale, yet training models with billions to trillions of parameters requires staggering computational resources, e.g. a one-trillion-parameter GPT-style model requires an estimated 120 million exaflops. This challenge necessitates efficient distributed training strategies on cutting-edge High-Performance Computing (HPC) infrastructure. In this work, we explore the SuperMUC-NG Phase 2 (SMNG-P2) system at the Leibniz Supercomputing Centre (LRZ) in Garching, Germany, equipped with Intel Data Center GPU Max 1550 accelerators to extract the necessary computational power. We enable and investigate a comprehensive recipe of parallel training techniques, including tensor parallelism, pipeline parallelism, and sharded data parallelism, essential for facilitating the training of LLMs up to 175 billion-parameter scale on SMNG-P2. Through empirical assessment and extensive hyperparameter tuning, we analyze the complex interplay among these techniques and determine their impact on GPU computational efficiency. We identify an optimized combined strategy that yields high throughput and enables the efficient training of LLMs of varying sizes. Specifically, for the 175B model, we achieved per-tile throughput of 10% of theoretical peak per-tile bf16 FLOPs, employing an out-of-the-box publicly available software stack, utilizing standard distributions without further modification. This approach ensures broad accessibility, as our methodology can be replicated by any user on SMNG-P2 system without need for porting or specialized software engineering. Furthermore, we achieved 93% weak scaling efficiency and strong scaling efficiency of 82% on 128 nodes. This scalable recipe provides a crucial blueprint for efficiently utilizing advanced exascale systems for next-generation foundational model development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical report on LLM training performance on SuperMUC-NG Phase 2 with standard parallelism techniques, delivering concrete scaling numbers but resting on an unverified claim of easy replication.

read the letter

The main thing to know is that the paper measures what you can actually get when training up to 175B models on the new Intel GPU nodes at LRZ using tensor, pipeline, and sharded data parallelism. For the 175B case they hit 10% of per-tile bf16 peak throughput, 93% weak scaling, and 82% strong scaling on 128 nodes, all with an unmodified public software stack. That is the useful part: real numbers on a current exascale-class system rather than another simulation or small-scale test.

Referee Report

2 major / 2 minor

Summary. The paper presents a recipe for distributed training of LLMs up to 175B parameters on SuperMUC-NG Phase 2 (Intel Data Center GPU Max 1550) by combining tensor, pipeline, and sharded data parallelism. Through empirical tuning it reports 10% of theoretical per-tile BF16 peak FLOPs throughput for the 175B model, 93% weak-scaling efficiency, and 82% strong-scaling efficiency on 128 nodes, all achieved with an unmodified, publicly available software stack.

Significance. If the concrete configuration parameters are supplied, the work supplies a practical, hardware-specific blueprint that lowers the barrier for other users to train large models on this exascale-class system. The reported scaling numbers on real hardware constitute useful reference data for the distributed-training community.

major comments (2)

[Abstract / §4] Abstract and §4 (Results): the central reproducibility claim—that the 10% peak throughput and scaling efficiencies can be matched by any user “without need for porting or specialized software engineering”—is load-bearing yet unsupported. The text states that the numbers were obtained after “extensive hyperparameter tuning” but does not list the final tensor-parallelism, pipeline-parallelism, data-parallelism degrees, micro-batch size, activation-checkpointing policy, or optimizer settings used for the 175B run. Without these values the recipe reduces to “repeat the same search,” contradicting the no-additional-engineering guarantee.
[§3 / §4] §3 (Methodology) and §4: no raw throughput measurements, error bars, or exclusion criteria for the scaling experiments are provided. The 93% weak-scaling and 82% strong-scaling figures on 128 nodes cannot be independently verified or compared with other systems without the underlying per-node or per-tile numbers and the exact node counts at each scale point.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a brief statement of the exact model architecture (hidden size, layers, attention heads) and the precise BF16 FLOP counting convention used to compute the 10% peak figure.
[Figures] Figure captions and axis labels in the scaling plots should explicitly state the parallelism configuration and batch size for each curve.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of reproducibility and verifiability. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / §4] Abstract and §4 (Results): the central reproducibility claim—that the 10% peak throughput and scaling efficiencies can be matched by any user “without need for porting or specialized software engineering”—is load-bearing yet unsupported. The text states that the numbers were obtained after “extensive hyperparameter tuning” but does not list the final tensor-parallelism, pipeline-parallelism, data-parallelism degrees, micro-batch size, activation-checkpointing policy, or optimizer settings used for the 175B run. Without these values the recipe reduces to “repeat the same search,” contradicting the no-additional-engineering guarantee.

Authors: We agree that the absence of the final configuration parameters limits the immediate usability of the recipe and weakens the reproducibility claim. While the manuscript focuses on the combination of tensor, pipeline, and sharded data parallelism with an unmodified public software stack, providing the specific values for the 175B model is necessary to avoid requiring readers to repeat the tuning process. In the revised version we will add a table in §4 that explicitly lists the tensor-parallelism degree, pipeline-parallelism degree, data-parallelism degree, micro-batch size, activation-checkpointing policy, and optimizer settings used for the reported 175B run. This addition will directly support the claim that the approach can be replicated without specialized engineering. revision: yes
Referee: [§3 / §4] §3 (Methodology) and §4: no raw throughput measurements, error bars, or exclusion criteria for the scaling experiments are provided. The 93% weak-scaling and 82% strong-scaling figures on 128 nodes cannot be independently verified or compared with other systems without the underlying per-node or per-tile numbers and the exact node counts at each scale point.

Authors: We acknowledge that the scaling efficiencies cannot be fully verified or compared without the supporting raw data. The reported 93% weak-scaling and 82% strong-scaling figures on 128 nodes were derived from measured throughputs, yet the original submission did not include per-scale raw values, error bars, or criteria for excluding runs (e.g., those affected by transient system variability). In the revised manuscript we will expand §4 to include a table of per-node and per-tile throughput measurements at each scale point for both weak and strong scaling, together with standard deviations and notes on any exclusion criteria applied during data collection. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivation chain

full rationale

The paper reports direct hardware measurements of per-tile throughput (10% of theoretical bf16 peak) and scaling efficiencies (93% weak, 82% strong on 128 nodes) for a 175B model after hyperparameter tuning on SuperMUC-NG Phase 2. These are presented as experimental outcomes using an out-of-the-box software stack, not as outputs of any mathematical derivation, fitted model, or self-referential equation. No equations appear in the provided text, no parameters are fitted and then relabeled as predictions, and no self-citations are invoked to justify uniqueness or load-bearing premises. The claims are externally falsifiable by replication on the same system, satisfying the self-contained benchmark criterion.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that standard parallelism libraries function correctly on the target Intel GPU hardware without modification and that the chosen hyperparameter settings generalize beyond the tested runs.

free parameters (1)

parallelism degree combination
Extensive hyperparameter tuning mentioned to find the optimized strategy; specific values not listed in abstract.

axioms (1)

domain assumption Out-of-the-box public software stack requires no porting or modification on SMNG-P2
Stated as enabling broad accessibility for any user.

pith-pipeline@v0.9.0 · 5629 in / 1260 out tokens · 32412 ms · 2026-05-11T02:17:46.151866+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 2 internal anchors

[1]

Journal of Open Source Software10(110), 7975 (2025)

Balaprakash, P., Chard, R., Hall, D.M., Wild, S.M., Foster, I., Bouhlel, M.A., Ozik, J., Viquez, E., Choudhary, S., Archibald, R., et al.: Deephyper. Journal of Open Source Software10(110), 7975 (2025). https://doi.org/10.21105/joss.07975, https://joss.theoj.org/papers/10.21105/joss.07975

work page doi:10.21105/joss.07975 2025
[2]

Advances in Neural Information Processing Systems33(2020) Efficient Large-Scale Training of LLMs on SuperMUC-NG Phase 2 11

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A....

work page 2020
[3]

PaLM: Scaling Language Modeling with Pathways

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., Levskay...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

arXiv preprint arXiv:2508.14117 (2025)

Cielo, S., Pöppl, A., Pribec, I.: Sycl for energy-efficient numerical astrophysics: the case of dpecho. arXiv preprint arXiv:2508.14117 (2025)

work page arXiv 2025
[5]

arXiv preprint arXiv:2312.12705 (2023)

Dash, S., Lyngaas, S., Varma, N., Treichler, S., Liesen, S.: Optimizing distributed training on frontier for large language models. arXiv preprint arXiv:2312.12705 (2023)

work page arXiv 2023
[6]

Proceedings of the 27th ACM Symposium on Operating Systems Principles (2018)

Harlap, A., Narayanan, D., Phanishayee, A., Seshadri, V., Dandekar, S., Ganger, G.R., Gibbons, P.B.: Pipedream: Fast and efficient pipeline parallel dnn training. Proceedings of the 27th ACM Symposium on Operating Systems Principles (2018)

work page 2018
[7]

Advances in Neural Information Processing Systems32 (2019)

Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q.V., Wu, Y., Chen, Z.: Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in Neural Information Processing Systems32 (2019)

work page 2019
[8]

Intel: Intel data center gpu max 1550 product specifications (2023), https://www.intel.com/content/www/us/en/products/sku/232873/ intel-data-center-gpu-max-1550.html, accessed 2026-03-01

work page 2023
[9]

io/intel-extension-for-pytorch/, accessed 2026-03-01

Intel: Intel extension for pytorch documentation (2024),https://intel.github. io/intel-extension-for-pytorch/, accessed 2026-03-01

work page 2024
[10]

Intel: Intel extension for pytorch (github repository) (2024),https://github.com/ intel/intel-extension-for-pytorch, accessed 2026-03-01

work page 2024
[11]

Leibniz Supercomputing Centre of the Bavarian Academy of Sciences (LRZ): New supercomputer that offers more methods (supermuc-ng phase 2 in test) (Feb 2024), https://www.lrz.de/en/news/detail/ 2024-02-28-supermuc-ng-phase2-in-test-en, accessed 2026-03-01

work page 2024
[12]

Leibniz Supercomputing Centre of the Bavarian Academy of Sciences (LRZ): Probezeiteinessupercomputers:Supermuc-ngphase2pilotphase(May2024), https: //www.lrz.de/news/detail/2024-05-06-sng-2-pilotphase , accessed 2026-03-01

work page 2024
[13]

Leibniz Supercomputing Centre of the Bavarian Academy of Sciences (LRZ): Supermuc-ng intro v3 (2024),https://doku.lrz.de/files/2061175136/ 2060255562/18/1763042826963/SuperMUC-NG-Intro-v3.pdf, accessed 2026-03-02

work page arXiv 2024
[14]

training

Lockwood, G.K.: A closer look at "training" a trillion-parameter model on frontier. Glenn K. Lockwood’s Blog (2024),https://blog.glennklockwood.com/2024/01/ a-closer-look-at-training-trillion.html

work page 2024
[15]

Microsoft: Megatron-deepspeed (github repository) (2024),https://github.com/ microsoft/Megatron-DeepSpeed, accessed 2026-03-01

work page 2024
[16]

Efficient large-scale language model training on GPU clusters using Megatron-LM.arXiv preprint arXiv:2104.04473,

Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., Phanishayee, A., Zaharia, M.: Efficient large-scale language model training on gpu clusters using megatron-lm. arXiv preprint (2021), arXiv:2104.04473v5

work page arXiv 2021
[17]

NVIDIA: Megatron-lm (github repository) (2024),https://github.com/NVIDIA/ Megatron-LM, accessed 2026-03-01 12 Ajay Navilarekal Rajgopal and Nikolai Solmsdorf

work page 2024
[18]

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improv- ing language understanding by generative pre-training (2018), https: //cdn.openai.com/research-covers/language-unsupervised/language_ understanding_paper.pdf

work page 2018
[19]

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019), https://cdn.openai.com/better-language-models/language_models_are_ unsupervised_multitask_learners.pdf

work page 2019
[20]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: Zero: Memory optimizations toward training trillion parameter models. arXiv preprint arXiv:1910.02054 (2020)

work page arXiv 1910
[21]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-lm: Training multi-billion parameter language models using model paral- lelism. arXiv preprint arXiv:1909.08053 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1909
[22]

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Smith,S.,Patwary,M.,Norick,B.,LeGresley,P.,Rajbhandari,S.,Casper,J.,Liu,Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., Zhang, E., Child, R., Aminabadi, R.Y., Bernauer, J., Song, X., Shoeybi, M., He, Y., Houston, M., Tiwary, S., Catanzaro, B.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv ...

work page Pith review arXiv 2022
[23]

Advances in Neural Information Processing Systems30(2017)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems30(2017)

work page 2017