arxiv: 2604.12973 · v2 · submitted 2026-04-14 · 💻 cs.DC

Recognition: unknown

An Engineering Journey Training Large Language Models at Scale on Alps: The Apertus Experience

Alejandro Hern\'andez Cano, Antoine Bosselut, Antoni-Joan Solergibert i Llaquet, Elia Palme, Fawzi Roberto Mohamed, Henrique Mendon\c{c}a, Igor Gorodetsky, Imanol Schlag, Isa Wazirzada, Jonathan Coles, Joost VandeVondele, Josh Romero, Lukas Drescher, Mark Klein, Martin Jaggi, Maxime Martinasso, Miguel Gila, Nicholas John Browning, Ryan Hankins, Stefano Schuppli, Theofilos Ioannis Manitaras, Thomas Schulthess, Torsten Hoefler

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:11 UTC · model grok-4.3

classification 💻 cs.DC

keywords LLM trainingHPC infrastructuresupercomputerpre-trainingopen modelsmultilingual AIengineering challengesAlps system

0 comments

The pith

A public European supercomputer completes pre-training of a 70B open multilingual LLM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper recounts the engineering process of pre-training Apertus, a fully open 70B-parameter multilingual foundation model, on the Alps supercomputer using NVIDIA GH200 chips. It focuses on the specific challenges of adapting HPC systems for AI workloads, such as resolving storage bottlenecks and stabilizing large-scale interconnects. These adaptations convert the supercomputer into a resilient software-defined machine learning platform. The result demonstrates that academic institutions can achieve large-scale model training previously limited to private companies. The experience also establishes ongoing capabilities for post-training tasks like fine-tuning beyond the initial pre-training run.

Core claim

The paper claims that a massive pre-training campaign for the Apertus 70B model succeeded on one of Europe's largest open-science supercomputers. Targeted fixes to storage systems and interconnect stability enabled stable training at this scale. The effort transforms standard HPC infrastructure into a reliable platform for large language model development. This serves as a first-of-its-kind academic achievement and provides the foundation for sustained iterative operations including fine-tuning of foundation models.

What carries the argument

Infrastructure adaptations to storage and interconnect stabilization that convert an HPC supercomputer into a resilient software-defined machine learning platform.

Load-bearing premise

The described changes to storage and interconnects proved sufficient for completing stable training without major undisclosed performance shortfalls or failures.

What would settle it

An independent run or audit that reproduces the full pre-training campaign and measures actual uptime, throughput, and completion rates against the reported outcomes.

Figures

Figures reproduced from arXiv: 2604.12973 by Alejandro Hern\'andez Cano, Antoine Bosselut, Antoni-Joan Solergibert i Llaquet, Elia Palme, Fawzi Roberto Mohamed, Henrique Mendon\c{c}a, Igor Gorodetsky, Imanol Schlag, Isa Wazirzada, Jonathan Coles, Joost VandeVondele, Josh Romero, Lukas Drescher, Mark Klein, Martin Jaggi, Maxime Martinasso, Miguel Gila, Nicholas John Browning, Ryan Hankins, Stefano Schuppli, Theofilos Ioannis Manitaras, Thomas Schulthess, Torsten Hoefler.

**Figure 2.** Figure 2: Comparison of throughput of the 70B Apertus pre-training on 2048 GPUs before and after stability improvements. Top: Runs prior to stability tuning show high variability and frequent restarts, largely driven by poor filesystem I/O before migrating to full-flash storage, and an NVIDIA driver issue related to access counter-based memory page migration. Bottom: Performance after stability enhancements, exhibit… view at source ↗

**Figure 3.** Figure 3: Scaling of the Apertus 70B model. Strong scaling parallel efficiency is shown with blue circles. Weak scaling parallel efficiency is shown with green squares. The global batch size was held constant at 16.8 M tokens for the strong scaling while the global batch size varies from 0.13 M to 16.8 M tokens with increasing GPU count for weak scaling. The figure is reproduced from the Apertus Technical Report for… view at source ↗

read the original abstract

Large Language Models (LLMs) have surged as a transformative technology for science and society, prompting governments worldwide to pursue sovereign AI capabilities that ensure data compliance and cultural representation. However, the associated capital costs and engineering complexity required to train these models have largely restricted such capabilities to the private sector, leaving a significant gap for public institutions. This paper details the engineering journey behind training Apertus, a fully open multilingual foundation model, on the Alps supercomputer. Representing a first-of-its-kind achievement for academia at the 70B parameter scale, we successfully deployed a massive pre-training campaign on one of Europe's largest systems for open science, powered by NVIDIA GH200 Grace Hopper Superchips. We detail the challenges encountered in readying HPC infrastructure for training AI models, from overcoming storage bottlenecks to stabilizing large-scale interconnects, and the lessons learned in transforming a supercomputer into a resilient software-defined Machine Learning Platform. Finally, we discuss the post-training requirements and evolution of our Machine Learning platform, outlining how this initial release lays the groundwork for a sustained, iterative operational capability, in particular for fine tuning foundation models, that extends well beyond a single model training run.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper describes the engineering process of training Apertus, a 70B-parameter open multilingual LLM, on the Alps supercomputer using NVIDIA GH200 Grace Hopper Superchips. It details challenges in adapting HPC infrastructure—including storage bottlenecks and large-scale interconnect stabilization—along with lessons for transforming the system into a resilient software-defined ML platform, and outlines post-training requirements for iterative fine-tuning. The central claim is that this represents a first-of-its-kind successful academic deployment at the 70B scale for open science.

Significance. If the infrastructure adaptations enabled stable training as described, the work offers practical value by documenting real-world HPC-to-ML transitions on a major European public system, supporting sovereign AI efforts outside private-sector dominance. The narrative of challenges and platform evolution provides transferable lessons for similar large-scale training campaigns, though the lack of supporting performance data reduces its utility as a reproducible reference.

major comments (2)

Abstract and main narrative on the pre-training campaign: The assertion of successful stable 70B-scale training is presented without any quantitative metrics (e.g., achieved tokens/sec, sustained TFLOPS utilization, restart/failure counts, or effective vs. theoretical compute delivered). This leaves the claims of 'resilient platform' and 'successful deployment' as qualitative descriptions whose effectiveness cannot be evaluated, directly undermining the load-bearing assertion of first-of-its-kind academic achievement.
Sections detailing storage and interconnect adaptations: The manuscript states that bottlenecks were overcome and interconnects stabilized to enable training, but provides no before/after benchmarks, ablation results, or failure-mode analysis to demonstrate that these changes were sufficient and that no major undisclosed shortfalls occurred.

minor comments (1)

The abstract and text would benefit from explicit cross-references to any accompanying figures or tables that might illustrate the platform architecture or timeline of adaptations.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback, which identifies key opportunities to strengthen the evidential basis of our claims. We have revised the manuscript to incorporate additional operational details and clarifications while preserving the paper's focus as an engineering case study on HPC-to-ML platform adaptation.

read point-by-point responses

Referee: Abstract and main narrative on the pre-training campaign: The assertion of successful stable 70B-scale training is presented without any quantitative metrics (e.g., achieved tokens/sec, sustained TFLOPS utilization, restart/failure counts, or effective vs. theoretical compute delivered). This leaves the claims of 'resilient platform' and 'successful deployment' as qualitative descriptions whose effectiveness cannot be evaluated, directly undermining the load-bearing assertion of first-of-its-kind academic achievement.

Authors: We agree that quantitative metrics would allow readers to better evaluate the stability and scale of the training run. The manuscript prioritizes documenting the engineering challenges and platform evolution over performance benchmarking, as these aspects represent the primary contribution for academic and public-sector efforts. In the revised version, we have added a summary of key campaign statistics in Section 3, including approximate sustained throughput in tokens per second, total tokens processed, and the observed restart rate due to transient failures. We also include a high-level estimate of effective compute delivery relative to theoretical peak based on hardware specifications and wall-clock progress. These additions provide concrete anchors for the 'successful deployment' claim without converting the paper into a benchmarking study. revision: partial
Referee: Sections detailing storage and interconnect adaptations: The manuscript states that bottlenecks were overcome and interconnects stabilized to enable training, but provides no before/after benchmarks, ablation results, or failure-mode analysis to demonstrate that these changes were sufficient and that no major undisclosed shortfalls occurred.

Authors: We recognize that before/after benchmarks and ablation studies would strengthen the demonstration of sufficiency. The adaptations were implemented iteratively on a live, shared production system, which precluded controlled experimental runs or systematic pre/post measurements. In the revised manuscript, we have expanded the storage and interconnect sections with more precise descriptions of the observed bottlenecks, the exact configuration changes applied, and the qualitative improvements in training continuity that followed. We have also added a short failure-mode discussion outlining the types of interconnect and I/O issues encountered and how the mitigations addressed them. Full quantitative ablations remain outside the scope of what was feasible during the deployment. revision: partial

standing simulated objections not resolved

Systematic before/after benchmarks and ablation results for storage and interconnect changes, as these were applied incrementally in a production environment without dedicated test allocations.

Circularity Check

0 steps flagged

No circularity: purely descriptive engineering narrative

full rationale

The manuscript is an engineering report describing infrastructure adaptations, challenges (storage bottlenecks, interconnect stabilization), and the process of training a 70B model on the Alps supercomputer. It contains no equations, no derivations, no fitted parameters, no performance predictions, and no self-referential claims that reduce to inputs by construction. The central claim of successful deployment is a direct factual assertion from the authors' experience, not derived from any prior result or fit within the paper. None of the six enumerated circularity patterns apply; the text is self-contained as a qualitative account with no load-bearing mathematical or predictive steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering case study with no mathematical models, free parameters, axioms, or new entities postulated.

pith-pipeline@v0.9.0 · 5614 in / 977 out tokens · 28903 ms · 2026-05-10T14:11:39.810629+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 16 canonical work pages · 4 internal anchors

[1]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inProceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., 2017, p. 6000–6010

2017
[2]

AI for science: An emerging agenda,

P. Berens, K. Cranmer, N. D. Lawrence, U. von Luxburg, and J. Montgomery, “AI for science: An emerging agenda,”arXiv preprint arXiv:2303.04217, 2023. [Online]. Available: https://arxiv.org/ abs/2303.04217

work page arXiv 2023
[3]

Scientific discovery in the age of artificial intelligence,

H. Wang, T. Fu, Y . Duet al., “Scientific discovery in the age of artificial intelligence,”Nature, vol. 620, no. 7972, pp. 47–60, 2023

2023
[4]

The impact of large language models on scientific discovery: a preliminary study using GPT-4,

Microsoft Research AI4Science and Microsoft Quantum, “The impact of large language models on scientific discovery: a preliminary study using GPT-4,”ArXiv, vol. abs/2311.07361, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:265150648

work page arXiv 2023
[5]

Scaling Laws for Neural Language Models

J. Kaplanet al., “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[6]

Training compute-optimal large language models,

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. W. Rae, and L. Sifre, “Training compute-optimal large language models,” in Proceedings of the 36th ...

2022
[7]

AI datacenter energy dilemma - race for 100k clusters,

D. Patel and D. d’Obrenan, “AI datacenter energy dilemma - race for 100k clusters,”SemiAnalysis, 2024, accessed: 2025-05-

2024
[8]

Available: https://www.semianalysis.com/p/ai-datacenter- energy-dilemma-race

[Online]. Available: https://www.semianalysis.com/p/ai-datacenter- energy-dilemma-race
[9]

GPT-4 Technical Report

OpenAI, “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

The Llama 3 Herd of Models

A. at Meta, “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Fine-grained application energy and power measurements on the frontier exascale system,

M. Martinasso, M. Klein, and T. Schulthess, “Alps, a versatile research infrastructure,” inProceedings of the Cray User Group, ser. CUG ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 156–165. [Online]. Available: https://doi.org/10.1145/3757348.3757365

work page doi:10.1145/3757348.3757365 2025
[12]

https://arxiv.org/abs/2509.14233 (2025)

A. Hernandez-Canoet al., “Apertus: Democratizing open and compliant llms for global language environments,”arXiv preprint arXiv:2509.14233, 2025

work page arXiv 2025
[13]

Versatile software-defined cluster for HPC using cloud abstractions,

M. Martinasso, M. Klein, B. Cumming, M. Gila, F. A. Cruz, A. Madonna, M. S. Ballesteros, S. R. Alam, and T. C. Schulthess, “Versatile software-defined cluster for HPC using cloud abstractions,” Comput. Sci. Eng., vol. 26, no. 3, pp. 20–29, 2024. [Online]. Available: https://doi.org/10.1109/MCSE.2024.3394164

work page doi:10.1109/mcse.2024.3394164 2024
[14]

Sarus: Highly scalable docker containers for HPC systems,

L. Benedicic, F. A. Cruz, A. Madonna, and K. Mariotti, “Sarus: Highly scalable docker containers for HPC systems,” inHigh Performance Computing, M. Weiland, G. Juckeland, S. Alam, and H. Jagode, Eds. Cham: Springer International Publishing, 2019, pp. 46–60

2019
[15]

Firecrest v2: lessons learned from redesigning an API for scalable HPC resource access,

E. Palme, J. P. Dorsch, A. Khosravi, G. Pizzi, F. Pagnamenta, A. Ceriani, E. Koutsaniti, R. Sarmiento, I. Bonesana, and A. Dabin, “Firecrest v2: lessons learned from redesigning an API for scalable HPC resource access,”arXiv preprint arXiv:2512.11634, 2025. [Online]. Available: https://arxiv.org/abs/2512.11634

work page arXiv 2025
[16]

(2025) Container engine

Swiss National Supercomputing Centre (CSCS). (2025) Container engine. CSCS Documentation. Accessed: 2025-12-11. [Online]. Available: https://docs.cscs.ch/software/container-engine

2025
[17]

Ngc catalog,

NVIDIA Corporation, “Ngc catalog,” https://catalog.ngc.nvidia.com/, 2025, accessed: 2025-12-19

2025
[18]

Deriving activation functions using integration,

A. H. Huang and I. Schlag, “Deriving activation functions using integration,” 2025. [Online]. Available: https://arxiv.org/abs/2411.13010

work page arXiv 2025
[19]

In: Proceedings of 60 the 29th ACM International Conference on Architectural Support for Program- ming Languages and Operating Systems, Volume 2

J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. V oznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, A. Chourdia, W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W. Feng, J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar, L. Kirsch, M. Lazos, M. Lezcano, Y . Liang, J. Liang, Y . Lu, C. K. Luk, B. Maher, Y . Pan, C. Puhrsch, M....

work page doi:10.1145/3620665.3640366 2024
[20]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” 2020. [Online]. Available: https://arxiv.org/abs/1909.08053

work page internal anchor Pith review arXiv 2020
[21]

(2025) swiss-ai: GitHub Organization

swiss-ai. (2025) swiss-ai: GitHub Organization. Accessed: 2025-12-20. [Online]. Available: https://github.com/swiss-ai

2025
[22]

libfabric pull request #11079: Fix the use of alt read rget restricted tc type,

OFI Working Group, “libfabric pull request #11079: Fix the use of alt read rget restricted tc type,” https://github.com/ofiwg/libfabric/pull/ 11079, 2025, gitHub pull request

2025
[23]

aws-ofi-nccl: A plugin to enable libfab- ric as a network provider for NCCL applications,

Amazon Web Services (AWS), “aws-ofi-nccl: A plugin to enable libfab- ric as a network provider for NCCL applications,” https://github.com/ aws/aws-ofi-nccl, 2025, accessed: 2025-12-19

2025
[24]

(2025) Libfabric: OpenFabrics Interfaces (OFI) Documentation

OpenFabrics Interfaces Working Group. (2025) Libfabric: OpenFabrics Interfaces (OFI) Documentation. Accessed: 2025-12-20. [Online]. Available: https://ofiwg.github.io/libfabric/

2025
[25]

[PATCH] mm/mmu notifier.c: fix race in mmu interval notifier remove(),

A. Popple, “[PATCH] mm/mmu notifier.c: fix race in mmu interval notifier remove(),” https://lore.kernel.org/ r/20220420043734.476348-1-apopple@nvidia.com/, Apr 2022, kernel patch emailed to LKML; fixes race in mmu_interval_notifier_remove()

work page arXiv 2022
[26]

(2025) Trans- parent Hugepage Support – Memory Management (admin- guide/mm/transhuge)

The Linux Kernel Documentation Project. (2025) Trans- parent Hugepage Support – Memory Management (admin- guide/mm/transhuge). Accessed: 2025-12-20. [Online]. Available: https://docs.kernel.org/admin-guide/mm/transhuge.html

2025
[27]

Evans, M

J. Evans, M. Andersch, V . Sethi, G. Brito, and V . Mehta. (2022, Nov.) NVIDIA Grace Hopper Superchip Architecture In-Depth. NVIDIA. Ac- cessed: 2025-12-20. [Online]. Available: https://developer.nvidia.com/ blog/nvidia-grace-hopper-superchip-architecture-in-depth/

2022
[28]

Understanding data movement in tightly coupled heterogeneous systems: A case study with the grace hopper superchip,

L. Fusco, M. Khalilov, M. Chrapek, G. Chukkapalli, T. Schulthess, and T. Hoefler, “Understanding data movement in tightly coupled heterogeneous systems: A case study with the grace hopper superchip,”
[29]

Understanding data movement in tightly coupled heterogeneous systems: A case study with the grace hopper superchip.arXiv preprint arXiv:2408.11556, 2024

[Online]. Available: https://arxiv.org/abs/2408.11556

work page arXiv
[30]

Wright, M

L. Wright, M. Vadakkanchery, S. Mishra, E. Krepska, H. Shojanazeri, P. Fernando, E. Petersen, M. Cala, and C. Smith. (2025, Apr.) 6x faster Async Checkpointing in PyTorch, using Cached Plans, no GIL contention. PyTorch Foundation. Accessed: 2025-12-20. [Online]. Available: https://pytorch.org/blog/6x-faster-async-checkpointing/

2025
[31]

Efficient large-scale language model training on GPU clusters using Megatron-LM.arXiv preprint arXiv:2104.04473,

D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V . A. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia, “Efficient large-scale language model training on gpu clusters using megatron-lm,” 2021. [Online]. Available: https://arxiv.org/abs/2104.04473

work page arXiv 2021
[32]

Fine-grained application energy and power measurements on the frontier exascale system,

S. Schuppli, F. Mohamed, H. Mendonca, N. Mujkanovic, E. Palme, D. Conciatore, L. Drescher, M. Gila, P. Witlox, J. VandeV ondele, M. Martinasso, T. C. Schulthess, and T. Hoefler, “Evolving hpc services to enable ml workloads on hpe cray ex,” inProceedings of the Cray User Group, ser. CUG ’25. New York, NY , USA: Association for Computing Machinery, 2025, p...

work page doi:10.1145/3757348.3757366 2025