pith. machine review for the scientific record. sign in

arxiv: 2605.08578 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Probing the Impact of Scale on Data-Efficient, Generalist Transformer World Models for Atari

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords transformer world modelsscaling regimesAtaridata-efficient learninggeneralist modelsoffline reinforcement learning
0
0 comments X

The pith

Joint training on 26 Atari environments stabilizes scaling in a transformer world model, producing monotonic fidelity gains and downstream policies with a 0.770 median normalized score.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests a minimalist transformer as a world model on fixed offline datasets from Atari 100k, keeping data budgets and capacities identical across tasks. Individual environments split into two scaling regimes: some improve steadily as the model grows larger, while others lose accuracy once past a certain size. Training one shared transformer across all 26 environments removes this split and produces reliable gains in every game. The resulting higher-fidelity models then support policies trained entirely inside the simulation that reach strong performance when transferred to the real environments. The central message is that the choice of training regime and scaling approach can matter as much as new architectural ideas for data-efficient generalist systems.

Core claim

Environments fall into distinct scaling regimes even under identical offline data and model capacity: some allow models to pass the interpolation threshold and show monotonic improvements in the overparameterized regime, while others remain in the classical regime where larger models reduce fidelity. In the unified setting a single transformer trained jointly on the full suite of 26 environments stabilizes these dynamics and delivers monotonic gains across every environment regardless of its individual regime. Improved world-model fidelity translates directly to control, with policies learned entirely inside the simulated dynamics attaining a median expert-random-normalized score of 0.770.

What carries the argument

The minimalist transformer world model under joint versus per-environment training, which exposes and then removes environment-specific scaling regimes.

If this is right

  • Individual environments exhibit fundamentally different scaling behaviors under the same data budget and model family.
  • Joint training across environments removes per-task scaling differences and guarantees consistent improvement with added scale.
  • Higher world-model accuracy obtained through joint training produces stronger control policies when those policies are trained entirely in simulation.
  • Progress toward data-efficient generalist systems requires attention to scaling and training regime choices in addition to architectural changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The stabilization effect may extend to other multi-task settings where shared training data can override single-task scaling ceilings.
  • Future experiments could test whether the same joint-training benefit appears when the world model is updated online rather than trained on fixed expert traces.
  • The result suggests that generalist world models may reduce the need for environment-specific hyperparameter tuning around model size.

Load-bearing premise

The fixed offline datasets collected from a single expert policy supply an unbiased and complete picture of each environment's dynamics without interference from online interaction or policy-dependent sampling.

What would settle it

Repeating the scaling curves after replacing the expert-derived datasets with data collected from a different or weaker policy and checking whether the joint-training stabilization and monotonic gains survive.

Figures

Figures reproduced from arXiv: 2605.08578 by Jooyeon Kim.

Figure 1
Figure 1. Figure 1: Overview of the minimalistic world model. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (Top left) Schematic of the deep double descent phenomenon with the generalization risk peaking at the interpolation threshold, which dichotomizes the classical and modern overparameterized regimes. (Top right) Divergent generalization regimes. Even with the identical sample budget (N = 105 ) and model configurations (L = 2 . . . 96), increasing model size yields different trends depending on task variatio… view at source ↗
Figure 3
Figure 3. Figure 3: (Left) Average loss curves across 26 separately-trained transformer world models, each with its corresponding Atari environment. The global trend shows monotonic improvement with model depth, driven by the prevalence of monotonic scaling regime tasks (12/26) (see [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Validation loss comparison between the unified model [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Learning dynamics of the unified world model [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Downstream policy performance in learned environments. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparisons of the training results of the original PPO paper (O; left; gray) and the newly trained [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Learning curves of the presupposed PPO algorithms on 26 Atari 100K benchmark environments. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Collection of the presupposed trajectories with 100K environmental steps. We gradually decrease [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: MSE loss on validation sets for the VAE models. The size of the input image is [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: VAE training results with the original observations at the bottom and the reconstructed (encoded [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Empirical loss curves of the transformer world models, each of which was separately trained on 26 [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Environment-wise results for the unified world model. In stark contrast to the individual setting, [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Learning curves for the policy learning in world models. [PITH_FULL_IMAGE:figures/full_fig_p032_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Scaling behavior under a strictly fixed total data budget of 100k frames. The x-axis denotes [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗
read the original abstract

Developing generalist systems that retain human-like data efficiency is a central challenge. While world models (WMs) offer a promising path, existing research often conflates architectural mechanisms with the independent impact of model \emph{scale}. In this work, we use a minimalist transformer world model to analyze scaling behaviors on the Atari 100k benchmark, using fixed offline datasets derived from a presupposed expert policy. Our results reveal that environments fundamentally fall into distinct scaling regimes, even when constrained by identical offline data budgets and model capacities. For individual tasks, some environments naturally allow models to pass the interpolation threshold, yielding monotonic improvements in the overparameterized regime, while others remain trapped in the classical regime, where larger world models degrade fidelity. In the unified setting, i.e., a single transformer trained on a suite of 26 Atari environments, we uncover that joint training stabilizes scaling dynamics, ensuring monotonic gains across all environments, regardless of their distinct inherent scaling regimes. Finally, we demonstrate that improved fidelity translates directly to downstream control, with policies learned entirely within the simulated dynamics achieving a median expert-random-normalized score of 0.770. Our findings suggest that future progress lies as much in precise scaling strategies as in architectural innovation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that individual Atari environments exhibit distinct scaling regimes for a minimalist transformer world model trained on fixed offline expert datasets under identical data budgets, with some environments showing monotonic fidelity gains in the overparameterized regime and others degrading in the classical regime. It further claims that joint training of a single transformer on the union of 26 Atari environments stabilizes scaling dynamics to produce monotonic gains across all environments regardless of their individual regimes, and that the resulting higher-fidelity world models enable downstream policies achieving a median expert-random-normalized score of 0.770.

Significance. If the stabilization from joint training holds after addressing data-volume confounds and includes statistical validation, the results would be significant for scaling laws in generalist world models and data-efficient RL, indicating that multi-task training can mitigate regime-specific overfitting issues and support progress toward generalist agents.

major comments (2)
  1. [Abstract and unified setting experiments] Abstract (unified setting claim): the assertion that joint training 'stabilizes scaling dynamics, ensuring monotonic gains across all environments, regardless of their distinct inherent scaling regimes' is load-bearing but threatened by the absence of any control for total training tokens, epochs, or gradient steps; the multi-task model is trained on the union of all 26 datasets and therefore sees substantially more environment transitions than any single-task run, which could produce the observed monotonicity through reduced overfitting on a larger corpus rather than cross-environment interaction.
  2. [Experimental setup and results] Experimental details (regime classification and evaluation): the abstract reports distinct scaling regimes and downstream control results but provides no information on statistical significance, error bars or run-to-run variance, exact model architectures, hyperparameter selection, or the criteria used to classify environments into scaling regimes, which undermines assessment of whether the post-hoc regime assignments and fidelity-to-control translation are robust.
minor comments (2)
  1. [Abstract] The median normalized score of 0.770 is reported without direct comparison to single-task world-model baselines or prior methods, limiting interpretation of the downstream control improvement.
  2. [Introduction and abstract] Clarify notation for 'interpolation threshold' and 'classical regime' with explicit references to prior scaling literature to aid readers unfamiliar with the terminology.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting potential data-volume confounds and the need for greater experimental transparency. We address each major comment below and commit to revisions that strengthen the claims without altering the core findings.

read point-by-point responses
  1. Referee: [Abstract and unified setting experiments] Abstract (unified setting claim): the assertion that joint training 'stabilizes scaling dynamics, ensuring monotonic gains across all environments, regardless of their distinct inherent scaling regimes' is load-bearing but threatened by the absence of any control for total training tokens, epochs, or gradient steps; the multi-task model is trained on the union of all 26 datasets and therefore sees substantially more environment transitions than any single-task run, which could produce the observed monotonicity through reduced overfitting on a larger corpus rather than cross-environment interaction.

    Authors: We agree this is a valid concern and a potential confound. The unified model inherently processes more total transitions due to the concatenated datasets. To isolate the contribution of cross-environment interactions, we will add a controlled ablation in the revised manuscript: single-task models will be trained with matched total tokens/gradient steps by repeating their fixed offline datasets (with appropriate shuffling) or extending epochs proportionally. We will report scaling curves under these matched budgets and test whether monotonic gains still emerge only under joint training. If the stabilization persists, this will support the cross-environment benefit; otherwise, we will qualify the claims accordingly. We will also update the abstract and discussion to explicitly note the data-volume difference and the new controls. revision: yes

  2. Referee: [Experimental setup and results] Experimental details (regime classification and evaluation): the abstract reports distinct scaling regimes and downstream control results but provides no information on statistical significance, error bars or run-to-run variance, exact model architectures, hyperparameter selection, or the criteria used to classify environments into scaling regimes, which undermines assessment of whether the post-hoc regime assignments and fidelity-to-control translation are robust.

    Authors: We acknowledge these omissions weaken the current presentation. In the revision we will: (1) report all results with error bars from at least three independent random seeds per configuration, including run-to-run variance; (2) add statistical significance tests (e.g., paired t-tests or Wilcoxon tests with p-values) for key comparisons between scaling regimes and between single- vs. multi-task fidelity; (3) specify the exact transformer architecture (layers, heads, embedding dimension, context length) and all hyperparameters; (4) detail the hyperparameter selection procedure (grid or random search ranges and validation metric); and (5) provide explicit, reproducible criteria for regime classification (e.g., sign of the slope of validation loss vs. model size in the overparameterized regime, with a quantitative threshold). These additions will be placed in a new 'Experimental Details' subsection and the appendix, allowing readers to evaluate robustness directly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations from controlled scaling experiments

full rationale

The paper reports direct empirical measurements of scaling behaviors for a minimalist transformer world model trained on fixed offline Atari 100k datasets derived from expert policies. Claims of distinct per-environment scaling regimes (monotonic vs. classical) and stabilization under joint multi-task training on the union of 26 datasets are presented as observed outcomes of training runs and downstream policy learning, with no mathematical derivation chain, equations, or fitted parameters that reduce predictions to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and the experimental design (identical per-task data budgets for single-task baselines) does not create self-definitional loops. The analysis remains self-contained as falsifiable empirical evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical scaling study with no new theoretical derivations; it relies on standard transformer sequence modeling assumptions and the Atari environment dynamics as given.

axioms (1)
  • domain assumption Transformer architecture can serve as a generalist world model for visual Atari observations
    Used as the minimalist base model throughout the scaling experiments.

pith-pipeline@v0.9.0 · 5514 in / 1279 out tokens · 47334 ms · 2026-05-12T00:59:10.951231+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 7 internal anchors

  1. [1]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  2. [2]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  3. [3]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  4. [4]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  5. [5]

    Proceedings of the IEEE , volume=

    Gradient-based learning applied to document recognition , author=. Proceedings of the IEEE , volume=. 1998 , doi=

  6. [6]

    Auto-Encoding Variational

    Kingma, Diederik and Welling, Max , booktitle=. Auto-Encoding Variational. 2014 , url=

  7. [7]

    ICML , year=

    Stochastic Backpropagation and Approximate Inference in Deep Generative Models , author=. ICML , year=

  8. [8]

    arXiv preprint arXiv:2308.11079 , year=

    Long-Term Prediction of Natural Video Sequences with Robust Video Predictors , author=. arXiv preprint arXiv:2308.11079 , year=

  9. [9]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Very Deep Convolutional Networks for Large-Scale Image Recognition , author=. arXiv preprint arXiv:1409.1556 , year=

  10. [10]

    Mastering

    Hafner, Danijar and Lillicrap, Timothy and Norouzi, Mohammad and Ba, Jimmy , booktitle=. Mastering. 2021 , url=

  11. [11]

    Nature , volume=

    Mastering diverse domains through world models , author=. Nature , volume=. 2023 , doi=

  12. [12]

    ICLR , year=

    Transformer-based World Models Are Happy With 100k Interactions , author=. ICLR , year=

  13. [13]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , author=. arXiv preprint arXiv:1308.3432 , year=

  14. [14]

    ICLR , year=

    The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , author=. ICLR , year=

  15. [15]

    Categorical Reparameterization with

    Jang, Eric and Gu, Shixiang and Poole, Ben , booktitle=. Categorical Reparameterization with. 2017 , url=

  16. [16]

    2023 , url=

    Zhang, Weipu and Wang, Gang and Sun, Jian and Yuan, Yetian and Huang, Gao , booktitle=. 2023 , url=

  17. [17]

    ICLR , year=

    Transformers are Sample-Efficient World Models , author=. ICLR , year=

  18. [18]

    Learning to Play

    Agarwal, Pranav and Andrews, Sheldon and Kahou, Samira Ebrahimi , booktitle=. Learning to Play. 2024 , url=

  19. [19]

    NeurIPS , year=

    Neural discrete representation learning , author=. NeurIPS , year=

  20. [20]

    Wayve Research , note=

    Hu, Edward and Yang, Qiankun and Lee, Alex and Gurjar, Sachin and Davari, Artur and Garner, Philip and Kwiatkowski, Rowan and McCutcheon, Robert and Bottarel, Fabrizio and Peyton-Jones, Simon and Klein, Bartek and Kendall, Alex and Shahidi, Niels , year=. Wayve Research , note=

  21. [21]

    Wayve Research , note=

    Hu, Edward and Yang, Qiankun and Lee, Alex and Davari, Artur and Gurjar, Sachin and Garner, Philip and Kwiatkowski, Rowan and McCutcheon, Robert and Bottarel, Fabrizio and Peyton-Jones, Simon and Klein, Bartek and Kendall, Alex and Shahidi, Niels , year=. Wayve Research , note=

  22. [22]

    Diffusion for World Modeling: Visual Details Matter in

    Eloi Alonso and Adam Jelley and Vincent Micheli and Anssi Kanervisto and Amos Storkey and Tim Pearce and Fran. Diffusion for World Modeling: Visual Details Matter in. NeurIPS , year=

  23. [23]

    NeurIPS , year=

    Denoising Diffusion Probabilistic Models , author=. NeurIPS , year=

  24. [24]

    arXiv preprint arXiv:1405.4733 , year=

    Multiple-environment Markov decision processes , author=. arXiv preprint arXiv:1405.4733 , year=

  25. [25]

    1998 , edition =

    Richard Sutton and Andrew Barto , title =. 1998 , edition =

  26. [26]

    Current Opinion in Behavioral Sciences , volume=

    Analogues of mental simulation and imagination in deep learning , author=. Current Opinion in Behavioral Sciences , volume=. 2019 , url=

  27. [27]

    Artificial Intelligence , volume=

    Planning and acting in partially observable stochastic domains , author=. Artificial Intelligence , volume=. 1998 , url=

  28. [28]

    NeurIPS , year=

    Recurrent world models facilitate policy evolution , author=. NeurIPS , year=

  29. [29]

    World Models

    World models , author=. arXiv preprint arXiv:1803.10122 , year=

  30. [30]

    2019 , number=

    Language Models are Unsupervised Multitask Learners , author=. 2019 , number=

  31. [31]

    NeurIPS , year =

    Attention Is All You Need , author =. NeurIPS , year =

  32. [32]

    ICLR , year=

    Learning latent dynamics for planning from pixels , author=. ICLR , year=

  33. [33]

    IEEE Transactions on Systems, Man, and Cybernetics , volume=

    Neuronlike adaptive elements that can solve difficult learning control problems , author=. IEEE Transactions on Systems, Man, and Cybernetics , volume=. 1983 , url=

  34. [34]

    Machine Learning , volume=

    Learning to Predict by the Methods of Temporal Differences , author=. Machine Learning , volume=. 1988 , url=

  35. [35]

    Transactions on Machine Learning Research , issn=

    A Generalist Agent , author=. Transactions on Machine Learning Research , issn=. 2022 , url=

  36. [36]

    Journal of Statistical Mechanics: Theory and Experiment , volume=

    Deep double descent: Where bigger models and more data hurt , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2021 , url=

  37. [37]

    ICLR , year=

    Recurrent experience replay in distributed reinforcement learning , author=. ICLR , year=

  38. [38]

    ICML , year=

    Deep Unsupervised Learning using Nonequilibrium Thermodynamics , author=. ICML , year=

  39. [39]

    NeurIPS , year=

    Generative Modeling by Estimating Gradients of the Data Distribution , author=. NeurIPS , year=

  40. [40]

    NeurIPS , year=

    Multi-game decision transformers , author=. NeurIPS , year=

  41. [41]

    Micheli, E

    Efficient world models with context-aware tokenization , author=. arXiv preprint arXiv:2406.19320 , year=

  42. [42]

    Neural Computation , volume =

    Backpropagation Applied to Handwritten Zip Code Recognition , author =. Neural Computation , volume =. 1989 , url =

  43. [43]

    Diffusion world model: Fu- ture modeling beyond step-by-step rollout for offline reinforcement learning.arXiv preprint arXiv:2402.03570, 2024

    Diffusion World Model: Future Modeling Beyond Step-by-Step Rollout for Offline Reinforcement Learning , author=. arXiv preprint arXiv:2402.03570 , year=

  44. [44]

    2024 , url=

    Zhang, Weipu and Wang, Gang and Sun, Jian and Yuan, Yetian and Huang, Gao , booktitle=. 2024 , url=

  45. [45]

    ICLR , year=

    Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion , author=. ICLR , year=

  46. [46]

    Ma, Haoyu and Wu, Jialong and Feng, Ningya and Xiao, Chenjun and Li, Dong and Hao, Jianye and Wang, Jianmin and Long, Mingsheng , booktitle=. Harmony. 2024 , url=

  47. [47]

    Nature , pages=

    Mastering diverse control tasks through world models , author=. Nature , pages=. 2025 , url=

  48. [48]

    ICLR , year=

    Efficiently Modeling Long Sequences with Structured State Spaces , author=. ICLR , year=

  49. [49]

    Transdreamer: Reinforcement learning with transformer world models, 2024

    Transdreamer: Reinforcement learning with transformer world models , author=. arXiv preprint arXiv:2202.09481 , year=

  50. [50]

    Facing off World Model Backbones:

    Deng, Fei and Park, Junyeong and Ahn, Sungjin , journal=. Facing off World Model Backbones:. 2023 , url=

  51. [51]

    Action-conditional video prediction using deep networks in

    Oh, Junhyuk and Guo, Xiaoxiao and Lee, Honglak and Lewis, Richard and Singh, Satinder , booktitle=. Action-conditional video prediction using deep networks in. 2015 , url=

  52. [52]

    ICLR , year=

    Model Based Reinforcement Learning for Atari , author=. ICLR , year=

  53. [53]

    Neural Computation , volume =

    Long Short-Term Memory , author =. Neural Computation , volume =. 1997 , url =

  54. [54]

    Training Agents Inside of Scalable World Models

    Training Agents Inside of Scalable World Models , author=. arXiv preprint arXiv:2509.24527 , year=

  55. [55]

    NeurIPS , year=

    Diffusion Forcing: Next-Token Prediction Meets Full-Sequence Diffusion , author=. NeurIPS , year=

  56. [56]

    NeurIPS , year=

    Deep transformers with latent depth , author=. NeurIPS , year=

  57. [57]

    ICML , year=

    Stabilizing Transformer Training by Preventing Attention Entropy Collapse , author=. ICML , year=

  58. [58]

    ICML , year=

    Which Tasks Should Be Learned Together in Multi-task Learning? , author=. ICML , year=

  59. [59]

    Transactions on Machine Learning Research , year=

    A generalist agent , author=. Transactions on Machine Learning Research , year=

  60. [60]

    Science , volume=

    Human-level concept learning through probabilistic program induction , author=. Science , volume=. 2015 , url=

  61. [61]

    On the Opportunities and Risks of Foundation Models

    On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

  62. [62]

    ICLR , year=

    Bigger, Regularized, Categorical: High-Capacity Value Functions are Efficient Multi-Task Learners , author=. ICLR , year=

  63. [63]

    Bigger, better, faster: Human-level

    Schwarzer, Max and Ceron, Johan Samir Obando and Courville, Aaron and Bellemare, Marc and Agarwal, Rishabh and Castro, Pablo Samuel , booktitle=. Bigger, better, faster: Human-level. 2023 , url=

  64. [64]

    ICLR , year=

    Learning Transformer-based World Models with Contrastive Predictive Coding , author=. ICLR , year=

  65. [65]

    ICLR , year=

    Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining , author=. ICLR , year=

  66. [66]

    Nature , volume=

    Loss of plasticity in deep continual learning , author=. Nature , volume=. 2024 , url=

  67. [67]

    Scaling Laws for Neural Language Models

    Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

  68. [68]

    NeurIPS , year=

    Training compute-optimal large language models , author=. NeurIPS , year=

  69. [69]

    ICLR , url=

    Decoupled Weight Decay Regularization , author=. ICLR , url=

  70. [70]

    NeurIPS , year=

    Tuning large neural networks via zero-shot hyperparameter transfer , author=. NeurIPS , year=

  71. [71]

    ICLR , url=

    Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , author=. ICLR , url=

  72. [72]

    In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

    In search of the real inductive bias: On the role of implicit regularization in deep learning , author=. arXiv preprint arXiv:1412.6614 , year=

  73. [73]

    JMLR , volume=

    The implicit bias of gradient descent on separable data , author=. JMLR , volume=. 2018 , url=

  74. [74]

    NeurIPS , year=

    Scaling laws in linear regression: Compute, parameters, and data , author=. NeurIPS , year=