Recognition: unknown
An Engineering Journey Training Large Language Models at Scale on Alps: The Apertus Experience
Pith reviewed 2026-05-10 14:11 UTC · model grok-4.3
The pith
A public European supercomputer completes pre-training of a 70B open multilingual LLM.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that a massive pre-training campaign for the Apertus 70B model succeeded on one of Europe's largest open-science supercomputers. Targeted fixes to storage systems and interconnect stability enabled stable training at this scale. The effort transforms standard HPC infrastructure into a reliable platform for large language model development. This serves as a first-of-its-kind academic achievement and provides the foundation for sustained iterative operations including fine-tuning of foundation models.
What carries the argument
Infrastructure adaptations to storage and interconnect stabilization that convert an HPC supercomputer into a resilient software-defined machine learning platform.
Load-bearing premise
The described changes to storage and interconnects proved sufficient for completing stable training without major undisclosed performance shortfalls or failures.
What would settle it
An independent run or audit that reproduces the full pre-training campaign and measures actual uptime, throughput, and completion rates against the reported outcomes.
Figures
read the original abstract
Large Language Models (LLMs) have surged as a transformative technology for science and society, prompting governments worldwide to pursue sovereign AI capabilities that ensure data compliance and cultural representation. However, the associated capital costs and engineering complexity required to train these models have largely restricted such capabilities to the private sector, leaving a significant gap for public institutions. This paper details the engineering journey behind training Apertus, a fully open multilingual foundation model, on the Alps supercomputer. Representing a first-of-its-kind achievement for academia at the 70B parameter scale, we successfully deployed a massive pre-training campaign on one of Europe's largest systems for open science, powered by NVIDIA GH200 Grace Hopper Superchips. We detail the challenges encountered in readying HPC infrastructure for training AI models, from overcoming storage bottlenecks to stabilizing large-scale interconnects, and the lessons learned in transforming a supercomputer into a resilient software-defined Machine Learning Platform. Finally, we discuss the post-training requirements and evolution of our Machine Learning platform, outlining how this initial release lays the groundwork for a sustained, iterative operational capability, in particular for fine tuning foundation models, that extends well beyond a single model training run.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper describes the engineering process of training Apertus, a 70B-parameter open multilingual LLM, on the Alps supercomputer using NVIDIA GH200 Grace Hopper Superchips. It details challenges in adapting HPC infrastructure—including storage bottlenecks and large-scale interconnect stabilization—along with lessons for transforming the system into a resilient software-defined ML platform, and outlines post-training requirements for iterative fine-tuning. The central claim is that this represents a first-of-its-kind successful academic deployment at the 70B scale for open science.
Significance. If the infrastructure adaptations enabled stable training as described, the work offers practical value by documenting real-world HPC-to-ML transitions on a major European public system, supporting sovereign AI efforts outside private-sector dominance. The narrative of challenges and platform evolution provides transferable lessons for similar large-scale training campaigns, though the lack of supporting performance data reduces its utility as a reproducible reference.
major comments (2)
- Abstract and main narrative on the pre-training campaign: The assertion of successful stable 70B-scale training is presented without any quantitative metrics (e.g., achieved tokens/sec, sustained TFLOPS utilization, restart/failure counts, or effective vs. theoretical compute delivered). This leaves the claims of 'resilient platform' and 'successful deployment' as qualitative descriptions whose effectiveness cannot be evaluated, directly undermining the load-bearing assertion of first-of-its-kind academic achievement.
- Sections detailing storage and interconnect adaptations: The manuscript states that bottlenecks were overcome and interconnects stabilized to enable training, but provides no before/after benchmarks, ablation results, or failure-mode analysis to demonstrate that these changes were sufficient and that no major undisclosed shortfalls occurred.
minor comments (1)
- The abstract and text would benefit from explicit cross-references to any accompanying figures or tables that might illustrate the platform architecture or timeline of adaptations.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which identifies key opportunities to strengthen the evidential basis of our claims. We have revised the manuscript to incorporate additional operational details and clarifications while preserving the paper's focus as an engineering case study on HPC-to-ML platform adaptation.
read point-by-point responses
-
Referee: Abstract and main narrative on the pre-training campaign: The assertion of successful stable 70B-scale training is presented without any quantitative metrics (e.g., achieved tokens/sec, sustained TFLOPS utilization, restart/failure counts, or effective vs. theoretical compute delivered). This leaves the claims of 'resilient platform' and 'successful deployment' as qualitative descriptions whose effectiveness cannot be evaluated, directly undermining the load-bearing assertion of first-of-its-kind academic achievement.
Authors: We agree that quantitative metrics would allow readers to better evaluate the stability and scale of the training run. The manuscript prioritizes documenting the engineering challenges and platform evolution over performance benchmarking, as these aspects represent the primary contribution for academic and public-sector efforts. In the revised version, we have added a summary of key campaign statistics in Section 3, including approximate sustained throughput in tokens per second, total tokens processed, and the observed restart rate due to transient failures. We also include a high-level estimate of effective compute delivery relative to theoretical peak based on hardware specifications and wall-clock progress. These additions provide concrete anchors for the 'successful deployment' claim without converting the paper into a benchmarking study. revision: partial
-
Referee: Sections detailing storage and interconnect adaptations: The manuscript states that bottlenecks were overcome and interconnects stabilized to enable training, but provides no before/after benchmarks, ablation results, or failure-mode analysis to demonstrate that these changes were sufficient and that no major undisclosed shortfalls occurred.
Authors: We recognize that before/after benchmarks and ablation studies would strengthen the demonstration of sufficiency. The adaptations were implemented iteratively on a live, shared production system, which precluded controlled experimental runs or systematic pre/post measurements. In the revised manuscript, we have expanded the storage and interconnect sections with more precise descriptions of the observed bottlenecks, the exact configuration changes applied, and the qualitative improvements in training continuity that followed. We have also added a short failure-mode discussion outlining the types of interconnect and I/O issues encountered and how the mitigations addressed them. Full quantitative ablations remain outside the scope of what was feasible during the deployment. revision: partial
- Systematic before/after benchmarks and ablation results for storage and interconnect changes, as these were applied incrementally in a production environment without dedicated test allocations.
Circularity Check
No circularity: purely descriptive engineering narrative
full rationale
The manuscript is an engineering report describing infrastructure adaptations, challenges (storage bottlenecks, interconnect stabilization), and the process of training a 70B model on the Alps supercomputer. It contains no equations, no derivations, no fitted parameters, no performance predictions, and no self-referential claims that reduce to inputs by construction. The central claim of successful deployment is a direct factual assertion from the authors' experience, not derived from any prior result or fit within the paper. None of the six enumerated circularity patterns apply; the text is self-contained as a qualitative account with no load-bearing mathematical or predictive steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Attention is all you need,
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inProceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., 2017, p. 6000–6010
2017
-
[2]
AI for science: An emerging agenda,
P. Berens, K. Cranmer, N. D. Lawrence, U. von Luxburg, and J. Montgomery, “AI for science: An emerging agenda,”arXiv preprint arXiv:2303.04217, 2023. [Online]. Available: https://arxiv.org/ abs/2303.04217
-
[3]
Scientific discovery in the age of artificial intelligence,
H. Wang, T. Fu, Y . Duet al., “Scientific discovery in the age of artificial intelligence,”Nature, vol. 620, no. 7972, pp. 47–60, 2023
2023
-
[4]
The impact of large language models on scientific discovery: a preliminary study using GPT-4,
Microsoft Research AI4Science and Microsoft Quantum, “The impact of large language models on scientific discovery: a preliminary study using GPT-4,”ArXiv, vol. abs/2311.07361, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:265150648
-
[5]
Scaling Laws for Neural Language Models
J. Kaplanet al., “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[6]
Training compute-optimal large language models,
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. W. Rae, and L. Sifre, “Training compute-optimal large language models,” in Proceedings of the 36th ...
2022
-
[7]
AI datacenter energy dilemma - race for 100k clusters,
D. Patel and D. d’Obrenan, “AI datacenter energy dilemma - race for 100k clusters,”SemiAnalysis, 2024, accessed: 2025-05-
2024
-
[8]
Available: https://www.semianalysis.com/p/ai-datacenter- energy-dilemma-race
[Online]. Available: https://www.semianalysis.com/p/ai-datacenter- energy-dilemma-race
-
[9]
OpenAI, “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
A. at Meta, “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Fine-grained application energy and power measurements on the frontier exascale system,
M. Martinasso, M. Klein, and T. Schulthess, “Alps, a versatile research infrastructure,” inProceedings of the Cray User Group, ser. CUG ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 156–165. [Online]. Available: https://doi.org/10.1145/3757348.3757365
-
[12]
https://arxiv.org/abs/2509.14233 (2025)
A. Hernandez-Canoet al., “Apertus: Democratizing open and compliant llms for global language environments,”arXiv preprint arXiv:2509.14233, 2025
-
[13]
Versatile software-defined cluster for HPC using cloud abstractions,
M. Martinasso, M. Klein, B. Cumming, M. Gila, F. A. Cruz, A. Madonna, M. S. Ballesteros, S. R. Alam, and T. C. Schulthess, “Versatile software-defined cluster for HPC using cloud abstractions,” Comput. Sci. Eng., vol. 26, no. 3, pp. 20–29, 2024. [Online]. Available: https://doi.org/10.1109/MCSE.2024.3394164
-
[14]
Sarus: Highly scalable docker containers for HPC systems,
L. Benedicic, F. A. Cruz, A. Madonna, and K. Mariotti, “Sarus: Highly scalable docker containers for HPC systems,” inHigh Performance Computing, M. Weiland, G. Juckeland, S. Alam, and H. Jagode, Eds. Cham: Springer International Publishing, 2019, pp. 46–60
2019
-
[15]
Firecrest v2: lessons learned from redesigning an API for scalable HPC resource access,
E. Palme, J. P. Dorsch, A. Khosravi, G. Pizzi, F. Pagnamenta, A. Ceriani, E. Koutsaniti, R. Sarmiento, I. Bonesana, and A. Dabin, “Firecrest v2: lessons learned from redesigning an API for scalable HPC resource access,”arXiv preprint arXiv:2512.11634, 2025. [Online]. Available: https://arxiv.org/abs/2512.11634
-
[16]
(2025) Container engine
Swiss National Supercomputing Centre (CSCS). (2025) Container engine. CSCS Documentation. Accessed: 2025-12-11. [Online]. Available: https://docs.cscs.ch/software/container-engine
2025
-
[17]
Ngc catalog,
NVIDIA Corporation, “Ngc catalog,” https://catalog.ngc.nvidia.com/, 2025, accessed: 2025-12-19
2025
-
[18]
Deriving activation functions using integration,
A. H. Huang and I. Schlag, “Deriving activation functions using integration,” 2025. [Online]. Available: https://arxiv.org/abs/2411.13010
-
[19]
J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. V oznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, A. Chourdia, W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W. Feng, J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar, L. Kirsch, M. Lazos, M. Lezcano, Y . Liang, J. Liang, Y . Lu, C. K. Luk, B. Maher, Y . Pan, C. Puhrsch, M....
-
[20]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” 2020. [Online]. Available: https://arxiv.org/abs/1909.08053
work page internal anchor Pith review arXiv 2020
-
[21]
(2025) swiss-ai: GitHub Organization
swiss-ai. (2025) swiss-ai: GitHub Organization. Accessed: 2025-12-20. [Online]. Available: https://github.com/swiss-ai
2025
-
[22]
libfabric pull request #11079: Fix the use of alt read rget restricted tc type,
OFI Working Group, “libfabric pull request #11079: Fix the use of alt read rget restricted tc type,” https://github.com/ofiwg/libfabric/pull/ 11079, 2025, gitHub pull request
2025
-
[23]
aws-ofi-nccl: A plugin to enable libfab- ric as a network provider for NCCL applications,
Amazon Web Services (AWS), “aws-ofi-nccl: A plugin to enable libfab- ric as a network provider for NCCL applications,” https://github.com/ aws/aws-ofi-nccl, 2025, accessed: 2025-12-19
2025
-
[24]
(2025) Libfabric: OpenFabrics Interfaces (OFI) Documentation
OpenFabrics Interfaces Working Group. (2025) Libfabric: OpenFabrics Interfaces (OFI) Documentation. Accessed: 2025-12-20. [Online]. Available: https://ofiwg.github.io/libfabric/
2025
-
[25]
[PATCH] mm/mmu notifier.c: fix race in mmu interval notifier remove(),
A. Popple, “[PATCH] mm/mmu notifier.c: fix race in mmu interval notifier remove(),” https://lore.kernel.org/ r/20220420043734.476348-1-apopple@nvidia.com/, Apr 2022, kernel patch emailed to LKML; fixes race in mmu_interval_notifier_remove()
-
[26]
(2025) Trans- parent Hugepage Support – Memory Management (admin- guide/mm/transhuge)
The Linux Kernel Documentation Project. (2025) Trans- parent Hugepage Support – Memory Management (admin- guide/mm/transhuge). Accessed: 2025-12-20. [Online]. Available: https://docs.kernel.org/admin-guide/mm/transhuge.html
2025
-
[27]
Evans, M
J. Evans, M. Andersch, V . Sethi, G. Brito, and V . Mehta. (2022, Nov.) NVIDIA Grace Hopper Superchip Architecture In-Depth. NVIDIA. Ac- cessed: 2025-12-20. [Online]. Available: https://developer.nvidia.com/ blog/nvidia-grace-hopper-superchip-architecture-in-depth/
2022
-
[28]
Understanding data movement in tightly coupled heterogeneous systems: A case study with the grace hopper superchip,
L. Fusco, M. Khalilov, M. Chrapek, G. Chukkapalli, T. Schulthess, and T. Hoefler, “Understanding data movement in tightly coupled heterogeneous systems: A case study with the grace hopper superchip,”
-
[29]
[Online]. Available: https://arxiv.org/abs/2408.11556
-
[30]
Wright, M
L. Wright, M. Vadakkanchery, S. Mishra, E. Krepska, H. Shojanazeri, P. Fernando, E. Petersen, M. Cala, and C. Smith. (2025, Apr.) 6x faster Async Checkpointing in PyTorch, using Cached Plans, no GIL contention. PyTorch Foundation. Accessed: 2025-12-20. [Online]. Available: https://pytorch.org/blog/6x-faster-async-checkpointing/
2025
-
[31]
D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V . A. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia, “Efficient large-scale language model training on gpu clusters using megatron-lm,” 2021. [Online]. Available: https://arxiv.org/abs/2104.04473
-
[32]
Fine-grained application energy and power measurements on the frontier exascale system,
S. Schuppli, F. Mohamed, H. Mendonca, N. Mujkanovic, E. Palme, D. Conciatore, L. Drescher, M. Gila, P. Witlox, J. VandeV ondele, M. Martinasso, T. C. Schulthess, and T. Hoefler, “Evolving hpc services to enable ml workloads on hpe cray ex,” inProceedings of the Cray User Group, ser. CUG ’25. New York, NY , USA: Association for Computing Machinery, 2025, p...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.