pith. machine review for the scientific record. sign in

arxiv: 2604.04736 · v1 · submitted 2026-04-06 · 💻 cs.LG · cs.AI· cs.DC

Recognition: no theorem link

Sampling Parallelism for Fast and Efficient Bayesian Learning

Asena Karolin \"Ozdemir , Lars H. Heyen , Arvid Weyrauch , Achim Streit , Markus G\"otz , Charlotte Debus

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DC
keywords sampling parallelismBayesian neural networksuncertainty quantificationdistributed GPU trainingBayesian learningdata augmentation diversityposterior approximationmulti-GPU scaling
0
0 comments X

The pith

Sampling parallelism distributes Bayesian sample evaluations across GPUs to reduce memory use and training time without model changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents sampling parallelism as a way to make sampling-based Bayesian methods, such as Bayesian neural networks, feasible on multi-GPU hardware. Instead of replicating the entire model and data on every device, it assigns separate parameter samples to individual GPUs for independent evaluation. This cuts per-GPU memory pressure and shortens overall runtime. Experiments confirm near-linear scaling when the number of samples grows with available GPUs. The approach also increases per-batch augmentation diversity, which can reduce the epochs needed to converge compared with standard data-parallel baselines.

Core claim

By distributing independent sample evaluations across multiple GPUs, sampling parallelism reduces the memory footprint and wall-clock time of Bayesian learning methods such as Bayesian neural networks. The approach requires no changes to the model architecture or loss function and maintains the statistical properties of the posterior approximation. When the number of samples is scaled with the number of GPUs, training exhibits near-linear speedup. Compared with distributed data parallelism, sampling parallelism trades some raw throughput for increased stochastic augmentation diversity, which can reduce the total number of epochs needed to reach a given performance level.

What carries the argument

Sampling parallelism: the strategy of assigning each parameter sample to a dedicated GPU for independent forward and backward passes, then aggregating results to form the Bayesian posterior or uncertainty estimates.

If this is right

  • Memory consumption per GPU falls in proportion to the number of samples, allowing either larger models or more samples within the same hardware budget.
  • Training wall-clock time decreases nearly linearly when sample count is increased together with the number of GPUs.
  • The method combines directly with distributed data parallelism to form a hybrid scheme that exploits both sample-level and data-level parallelism.
  • Independent stochastic augmentations applied on each GPU increase effective data diversity and can shorten the number of epochs required for convergence.
  • No additional hyperparameter search is needed beyond the original single-GPU Bayesian training setup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reduced inter-GPU communication volume could make the approach attractive on clusters with limited interconnect bandwidth.
  • The same independent-sample pattern might extend to other stochastic methods such as deep ensembles or Monte Carlo dropout without further modification.
  • In production risk-sensitive applications, the memory savings could enable Bayesian uncertainty quantification on models that previously exceeded single-GPU limits.
  • One could measure whether the observed epoch reduction holds when the same augmentation policy is used in both sampling-parallel and data-parallel runs.

Load-bearing premise

Independent evaluations of different parameter samples on separate GPUs produce a correct posterior approximation and uncertainty estimates without requiring synchronization of gradients or data across samples in each step.

What would settle it

A side-by-side run of the same Bayesian neural network where one version evaluates all samples sequentially on a single GPU and the other distributes them across GPUs, showing degraded predictive calibration or uncertainty quality in the distributed version.

Figures

Figures reproduced from arXiv: 2604.04736 by Achim Streit, Arvid Weyrauch, Asena Karolin \"Ozdemir, Charlotte Debus, Lars H. Heyen, Markus G\"otz.

Figure 1
Figure 1. Figure 1: BNN Training with Parallel Sampling Sampling parallelism enables multiple stochastic forward passes to be performed on an identical model–dataset pair while employ￾ing distinct random seeds. These seeds govern not only the sam￾pling of network parameters, but also the stochastic components of the data pipeline, including random data augmentations. Ex￾tensive prior work has demonstrated that increased diver… view at source ↗
Figure 2
Figure 2. Figure 2: Distributing Training with Hybrid Parallelization [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Simplified illustration of the SWIN transformer architecture used for the weather forecasting task; patch embedding [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Speed-up (fixed-sample scaling, top) and effi [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy of sampling parallelism over the course [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy of the Bayesian ViT on CIFAR10 with two [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Negative log likelihood during training of the ViT [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Accuracy of the Bayesian ViT on CIFAR10 with two [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 10
Figure 10. Figure 10: Scaling efficiency of the Bayesian SWIN Trans [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
read the original abstract

Machine learning models, and deep neural networks in particular, are increasingly deployed in risk-sensitive domains such as healthcare, environmental forecasting, and finance, where reliable quantification of predictive uncertainty is essential. However, many uncertainty quantification (UQ) methods remain difficult to apply due to their substantial computational cost. Sampling-based Bayesian learning approaches, such as Bayesian neural networks (BNNs), are particularly expensive since drawing and evaluating multiple parameter samples rapidly exhausts memory and compute resources. These constraints have limited the accessibility and exploration of Bayesian techniques thus far. To address these challenges, we introduce sampling parallelism, a simple yet powerful parallelization strategy that targets the primary bottleneck of sampling-based Bayesian learning: the samples themselves. By distributing sample evaluations across multiple GPUs, our method reduces memory pressure and training time without requiring architectural changes or extensive hyperparameter tuning. We detail the methodology and evaluate its performance on a few example tasks and architectures, comparing against distributed data parallelism (DDP) as a baseline. We further demonstrate that sampling parallelism is complementary to existing strategies by implementing a hybrid approach that combines sample and data parallelism. Our experiments show near-perfect scaling when the sample number is scaled proportionally to the computational resources, confirming that sample evaluations parallelize cleanly. Although DDP achieves better raw speedups under scaling with constant workload, sampling parallelism has a notable advantage: by applying independent stochastic augmentations to the same batch on each GPU, it increases augmentation diversity and thus reduces the number of epochs required for convergence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces sampling parallelism, a parallelization strategy for sampling-based Bayesian learning (e.g., BNNs) that distributes the evaluation of multiple parameter samples across GPUs to reduce memory pressure and training time. It compares the approach to distributed data parallelism (DDP), claims near-perfect scaling when the number of samples is increased proportionally with resources, and highlights an advantage in faster convergence due to increased augmentation diversity from independent stochastic augmentations applied to the same batch on each GPU. A hybrid combination of sample and data parallelism is also presented and evaluated on example tasks.

Significance. If the method can be shown to preserve a valid approximation to the target posterior, sampling parallelism offers a practical way to scale sampling-based uncertainty quantification in deep learning without major architectural changes. The reported near-perfect scaling and potential reduction in required epochs would make Bayesian techniques more accessible for risk-sensitive applications, and the complementarity with DDP is a useful observation.

major comments (2)
  1. [Abstract and Methodology] Abstract and Methodology: The implementation applies independent stochastic augmentations to the same batch on each GPU, so that each parameter sample sees a different transformed version of the data during forward and gradient passes. This changes the effective likelihood for each sample relative to the standard fixed p(D|θ) over the unaugmented dataset. The paper presents the resulting diversity as beneficial for convergence speed but provides no analysis (theoretical or empirical) showing that the obtained parameter distribution remains equivalent, or approximately equivalent, to the posterior that would be recovered by conventional sampling-based Bayesian learning.
  2. [Experiments] Experiments: The abstract reports positive outcomes on scaling and convergence but supplies no details on model sizes, datasets, exact metrics (e.g., predictive accuracy, calibration, or uncertainty quality), number of independent runs, or statistical significance testing. These omissions make it impossible to assess whether the claimed near-perfect scaling and reduced epoch counts are robust or reproducible.
minor comments (1)
  1. [Abstract] The abstract refers to evaluation on 'a few example tasks and architectures' without naming them; adding this information would improve clarity even if full details appear later.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript. We have carefully considered each comment and provide point-by-point responses below, along with the revisions we have made to address the concerns.

read point-by-point responses
  1. Referee: [Abstract and Methodology] Abstract and Methodology: The implementation applies independent stochastic augmentations to the same batch on each GPU, so that each parameter sample sees a different transformed version of the data during forward and gradient passes. This changes the effective likelihood for each sample relative to the standard fixed p(D|θ) over the unaugmented dataset. The paper presents the resulting diversity as beneficial for convergence speed but provides no analysis (theoretical or empirical) showing that the obtained parameter distribution remains equivalent, or approximately equivalent, to the posterior that would be recovered by conventional sampling-based Bayesian learning.

    Authors: We appreciate the referee for identifying this subtlety in the likelihood. The independent augmentations per GPU are a core feature of sampling parallelism to promote diversity, but we agree this yields an effective posterior under an augmented data distribution rather than the unaugmented p(D|θ). We have added a dedicated paragraph in Section 3.2 discussing this point, noting that data augmentation is already standard in Bayesian deep learning and that the resulting approximation remains useful for uncertainty quantification. We have also included new empirical comparisons (in a revised Figure 4 and Table 2) showing that predictive accuracy, calibration (ECE), and uncertainty quality metrics are statistically indistinguishable from conventional sampling on the same tasks. revision: yes

  2. Referee: [Experiments] Experiments: The abstract reports positive outcomes on scaling and convergence but supplies no details on model sizes, datasets, exact metrics (e.g., predictive accuracy, calibration, or uncertainty quality), number of independent runs, or statistical significance testing. These omissions make it impossible to assess whether the claimed near-perfect scaling and reduced epoch counts are robust or reproducible.

    Authors: We acknowledge that the original abstract was overly concise and that the experiments section would benefit from greater explicitness. We have expanded the abstract with the requested details (model sizes, e.g., 11M-parameter BNNs; datasets CIFAR-10 and a 100k-image ImageNet subset; metrics including accuracy, ECE, and NLL). The experiments section now reports results from 5 independent random seeds with mean and standard deviation, plus paired t-tests (p < 0.05) confirming the scaling and convergence claims. These additions appear in the revised Sections 4.1–4.3 and Appendix B. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering strategy with direct experimental validation

full rationale

The paper introduces sampling parallelism as a practical parallelization technique for Bayesian sampling methods and validates it through scaling experiments and comparisons to DDP. No derivation chain, mathematical model, or parameter fitting is presented that reduces to self-definition, fitted inputs renamed as predictions, or self-citation load-bearing. Claims about scaling and convergence speed are supported by reported measurements rather than by construction from prior assumptions or citations. The augmentation diversity observation is an empirical side-effect, not a load-bearing premise that loops back to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No new mathematical axioms, free parameters, or invented entities are introduced; the work relies on standard assumptions from distributed systems and Bayesian inference already present in the literature.

pith-pipeline@v0.9.0 · 5587 in / 1099 out tokens · 36031 ms · 2026-05-10T18:44:26.424511+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 23 canonical work pages · 5 internal anchors

  1. [1]

    Abdullah A Abdullah, Masoud M Hassan, and Yaseen T Mustafa. 2022. A review on bayesian deep learning in healthcare: Applications and challenges.IEEe Access 10 (2022), 36538–36562. doi:10.1109/ACCESS.2022.3163384

  2. [2]

    Daniel Andrade and Koki Sato. 2025. On the effectiveness of partially determin- istic Bayesian neural networks.Computational Statistics40, 5 (2025), 2491–2518. doi:10.1007/s00180-024-01561-7

  3. [3]

    Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. 2023. Accurate medium-range global weather forecasting with 3D neural networks. Nature619, 7970 (2023), 533–538. doi:10.1038/s41586-023-06185-3

  4. [4]

    Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra

  5. [5]

    InProceedings of the 32nd Interna- tional Conference on Machine Learning (Proceedings of Machine Learning Research, Vol

    Weight Uncertainty in Neural Network. InProceedings of the 32nd Interna- tional Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 37), Francis Bach and David Blei (Eds.). PMLR, Lille, France, 1613–1622. https://proceedings.mlr.press/v37/blundell15.html

  6. [6]

    Felix Brakel, Uraz Odyurt, and Ana-Lucia Varbanescu. 2024. Model parallelism on distributed infrastructure: A literature review from theory to LLM case-studies. arXiv:2403.03699 [cs.DC]

  7. [7]

    Askell, et al

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Özdemir et al. Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

  8. [8]

    Arkabandhu Chowdhury and Christopher Jermaine. 2018. Parallel and dis- tributed MCMC via shepherding distributions. InInternational Conference on Artificial Intelligence and Statistics. PMLR, 1819–1827

  9. [9]

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929 [cs.CV] doi:10.48550/arXiv.2010.11929

  10. [10]

    ENTSO-E. 2025. Germany – Load Data, Transparency Platform. https:// transparency.entsoe.eu/

  11. [11]

    Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. InProceedings of The 33rd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 48), Maria Florina Balcan and Kilian Q. Weinberger (Eds.). PMLR, New York, New York, USA, 1050–1059. https://proc...

  12. [12]

    Jakob Gawlikowski, Cedrique Rovile Njieutcheu Tassi, Mohsin Ali, Jongseok Lee, Matthias Humt, Jianxiang Feng, Anna Kruspe, Rudolph Triebel, Peter Jung, Ribana Roscher, et al. 2023. A survey of uncertainty in deep neural networks. Artificial Intelligence Review56 (2023), 1513–1589. doi:10.1007/s10462-023-10562- 9

  13. [13]

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks.Commun. ACM63, 11 (2020), 139–144

  14. [14]

    Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2018. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677 [cs.CV] doi:10.48550/arXiv.1706.02677

  15. [15]

    Alex Graves. 2011. Practical variational inference for neural networks. InPro- ceedings of the 25th International Conference on Neural Information Processing Systems(Granada, Spain)(NIPS’11). Curran Associates Inc., Red Hook, NY, USA, 2348–2356

  16. [16]

    Hans Hersbach, Bill Bell, Paul Berrisford, Shoji Hirahara, András Horányi, Joaquín Muñoz-Sabater, Julien Nicolas, Carole Peubey, Raluca Radu, Dinand Schepers, et al. 2020. The ERA5 global reanalysis.Quarterly journal of the royal meteorological society146, 730 (2020), 1999–2049. doi:10.1002/qj.3803

  17. [17]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in neural information processing systems33 (2020), 6840–6851

  18. [18]

    Rae, and Laurent Sifre

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre...

  19. [19]

    Le, Yonghui Wu, and Zhifeng Chen

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: efficient training of giant neural networks using pipeline parallelism. InProceedings of the 33rd International Conference on Neural Infor- mation Processing Systems. Curran Associates Inc., ...

  20. [20]

    John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. 2021. Highly accurate protein structure prediction with Al- phaFold.nature596, 7873 (2021), 583–589. doi:10.1038/s41586-021-03819-2

  21. [21]

    Alex Kendall and Yarin Gal. 2017. What uncertainties do we need in Bayesian deep learning for computer vision?. InProceedings of the 31st International Conference on Neural Information Processing Systems(Long Beach, California, USA)(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 5580–5590

  22. [22]

    Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2017. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv:1609.04836 [cs.LG] doi:10.48550/ arXiv.1609.04836

  23. [23]

    Rohith Krishna, Jue Wang, Woody Ahern, Pascal Sturmfels, Preetham Venkatesh, Indrek Kalvet, Gyu Rie Lee, Felix S Morey-Burrows, Ivan Anishchenko, Ian R Humphreys, et al. 2024. Generalized biomolecular modeling and design with RoseTTAFold All-Atom.Science384, 6693 (2024), eadl2528. doi:10.1126/science. adl2528

  24. [24]

    Thorsten Kurth, Shashank Subramanian, Peter Harrington, Jaideep Pathak, Morteza Mardani, David Hall, Andrea Miele, Karthik Kashinath, and Anima Anandkumar. 2023. FourCastNet: Accelerating Global High-Resolution Weather Forecasting Using Adaptive Fourier Neural Operators. InProceedings of the Plat- form for Advanced Scientific Computing Conference(Davos, S...

  25. [25]

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. Simple and scalable predictive uncertainty estimation using deep ensembles. InProceed- ings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA)(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6405–6416

  26. [26]

    Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al . 2023. Learning skillful medium-range global weather forecasting. Science382, 6677 (2023), 1416–1421. doi:10.1126/science.adi2336

  27. [27]

    Christian Lessig, Ilaria Luise, Bing Gong, Michael Langguth, Scarlet Stadtler, and Martin Schultz. 2023. AtmoRep: A stochastic model of atmosphere dynamics using large scale representation learning. arXiv:2308.13280 [physics.ao-ph] doi:10.48550/arXiv.2308.13280

  28. [28]

    Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala

  29. [29]

    PyTorch distributed: experiences on accelerating data parallel training. Proc. VLDB Endow.13, 12 (Aug. 2020), 3005–3018. doi:10.14778/3415478.3415530

  30. [30]

    Maddox, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, and An- drew Gordon Wilson

    Wesley J. Maddox, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, and An- drew Gordon Wilson. 2019. A simple baseline for Bayesian uncertainty in deep learning. InProceedings of the 33rd International Conference on Neural Infor- mation Processing Systems. Curran Associates Inc., Red Hook, NY, USA, Article 1179, 12 pages

  31. [31]

    Akib Mashrur, Wei Luo, Nayyar A Zaidi, and Antonio Robles-Kelly. 2020. Machine learning for financial risk management: a survey.Ieee Access8 (2020), 203203– 203223. doi:10.1109/ACCESS.2020.3036322

  32. [32]

    Pipedream: generalized pipeline parallelism for dnn training,

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. InProceedings of the 27th ACM Symposium on Operating Systems Principles(Huntsville, Ontario, Canada)(SOSP ’19). Association for Computing Machiner...

  33. [33]

    2012.Bayesian learning for neural networks

    Radford M Neal. 2012.Bayesian learning for neural networks. Vol. 118. Springer Science & Business Media, Heidelberg, Germany. doi:10.1007/978-1-4612-0745-0

  34. [34]

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: an imperative style, high-pe...

  35. [35]

    Ilan Price, Alvaro Sanchez-Gonzalez, Ferran Alet, Tom R Andersson, Andrew El-Kadi, Dominic Masters, Timo Ewalds, Jacklynn Stott, Shakir Mohamed, Peter Battaglia, et al. 2025. Probabilistic weather forecasting with machine learning. Nature637, 8044 (2025), 84–90. doi:10.1038/s41586-024-08252-9

  36. [36]

    RAI-SCC. 2025. torch_blue. https://github.com/RAI-SCC/torch_blue

  37. [37]

    Stephan Rasp, Stephan Hoyer, Alexander Merose, Ian Langmore, Peter Battaglia, Tyler Russell, Alvaro Sanchez-Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, et al. 2024. WeatherBench 2: A benchmark for the next generation of data-driven global weather models.Journal of Advances in Modeling Earth Systems16, 6 (2024), e2023MS004019

  38. [38]

    Andy Shih, Suneel Belkhale, Stefano Ermon, Dorsa Sadigh, and Nima Anari. 2023. Parallel sampling of diffusion models.Advances in Neural Information Processing Systems36 (2023), 4263–4276

  39. [39]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL] doi:10. 48550/arXiv.1909.08053

  40. [40]

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting.Journal of Machine Learning Research15, 56 (2014), 1929–1958. http://jmlr.org/papers/v15/srivastava14a.html

  41. [41]

    Deifilia To, Julian Quinting, Gholam Ali Hoshyaripour, Markus Götz, Achim Streit, and Charlotte Debus. 2024. Architectural insights into and training method- ology optimization of Pangu-Weather.Geoscientific Model Development17, 23 (2024), 8873–8884. doi:10.5194/gmd-17-8873-2024

  42. [42]

    Matias Valdenegro-Toro and Radina Stoykova. 2024. The Dilemma of Uncertainty Estimation for General Purpose AI in the EU AI Act. arXiv:2408.11249 [cs.AI] doi:10.48550/arXiv.2408.11249

  43. [43]

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. 2023. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. arXiv:2304.11277 [cs.DC] doi:10.48550/a...