pith. sign in

arxiv: 2605.18528 · v1 · pith:FPEVKMNQnew · submitted 2026-05-18 · 🧮 math.OC · cs.LG

Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

Pith reviewed 2026-05-20 08:45 UTC · model grok-4.3

classification 🧮 math.OC cs.LG
keywords scale-invariant optimizationheavy-tailed noisenonconvex stochastic optimizationspectral normoracle complexityScion methodneural network trainingHessian Lipschitz
0
0 comments X

The pith

Any scale-invariant first-order method using the spectral norm requires Ω(min{m,n} ε^{-(3p-2)/(p-1)}) oracle calls to reach an ε-stationary point under p-moment heavy-tailed noise when the matrix dimensions satisfy max{m,n}/(min{m,n})^2 is,

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies stochastic optimization of matrix-valued problems that arise in scale-invariant neural network layers, where the objective is to reach an approximate stationary point despite noise that obeys only a p-th moment bound rather than sub-Gaussian tails. It establishes a dimension-dependent lower bound that any first-order scale-invariant algorithm with spectral norm must pay a cost linear in the smaller matrix dimension and polynomial in the accuracy. The authors then construct a batched Scion method whose complexity exactly matches this lower bound and a transported Scion variant that improves the exponent when the Hessian is Lipschitz continuous.

Core claim

In nonconvex smooth stochastic optimization over R^{m×n} equipped with general norms, when max{m,n}/(min{m,n})^2 is large enough, every scale-invariant first-order method that uses the spectral norm must perform Ω(min{m,n} ε^{-(3p-2)/(p-1)}) calls to a stochastic oracle to produce an ε-stationary point under p-th-moment heavy-tailed noise. A batched Scion method attains the matching O(min{m,n} ε^{-(3p-2)/(p-1)}) upper bound; under the additional assumption that the Hessian is Lipschitz, a transported Scion method further reduces the complexity to O(min{m,n} ε^{-(5p-3)/(2p-2)}).

What carries the argument

The Scion method, a normalized update rule that respects input-output matrix norm geometry while using batching or transport to control variance from heavy-tailed gradients.

If this is right

  • The lower and upper bounds are tight, so the exponent (3p-2)/(p-1) is optimal for first-order scale-invariant methods under heavy tails.
  • Higher-order smoothness via Hessian Lipschitzness yields a strictly better exponent through the transported Scion construction.
  • The results apply to any matrix problem whose aspect ratio satisfies the stated dimension condition.
  • Practical heuristics can be layered on the transported Scion method while preserving its theoretical rate.
  • The dimension factor min{m,n} is unavoidable and grows with the smaller matrix side.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Optimizers that ignore the tail index p will pay a worse rate than necessary when real gradients exhibit heavy tails.
  • The transported variant may be worth testing on models whose weight matrices have extreme aspect ratios, such as wide embedding layers.
  • Whether these complexity improvements translate into faster wall-clock training or better generalization remains to be checked empirically.
  • Similar norm-geometry arguments could be applied to other structured parameter spaces common in modern architectures.

Load-bearing premise

The stochastic gradient noise satisfies a p-th moment bound for some p greater than 1.

What would settle it

An explicit scale-invariant first-order algorithm with spectral norm that reaches an ε-stationary point in o(min{m,n} ε^{-(3p-2)/(p-1)}) oracle calls for sufficiently unbalanced dimensions under the same p-moment noise model would falsify the lower bound.

read the original abstract

A growing lesson from neural network optimization is that optimizer design should respect how the model is parametrized. Scale-invariant methods become important because their normalized layerwise updates can not only support hyperparameter transfer across model sizes but exploit input-output matrix norm geometry. At the same time, stochastic gradient noises in deep learning are often far from sub-Gaussian and may exhibit heavy tails. These crucial observations have shaped recent algorithmic principles for training neural networks, yet their joint theoretical consequences remain underexplored. In particular, it is unclear what dimension dependence is unavoidable for scale-invariant methods with general input-output matrix norm, and whether higher-order smoothness can accelerate training under heavy-tailed noise. We study these questions through nonconvex smooth stochastic optimization over $\mathbb{R}^{m\times n}$ with general norms, where the goal is to achieve an $\epsilon$-stationary point under $p^{\mathrm{th}}$-moment heavy-tailed noise. Our first contribution is a dimension-dependent lower bound: when $\frac{\max\{m,n\}}{(\min\{m,n\})^2}$ is large enough, any scale-invariant first-order method with spectral norm requires $\Omega(\min\{m, n\}\epsilon^{-\frac{3p-2}{p-1}})$ oracle calls. We prove that a batched Scion method with spectral norm achieves the matching upper bound of $O(\min\{m, n\}\epsilon^{-\frac{3p-2}{p-1}})$. To exploit higher-order smoothness, we propose a transported Scion method and improve the bound to $O(\min\{m, n\}\epsilon^{-\frac{5p-3}{2p-2}})$ when the norm is spectral and the Hessian is Lipschitz. Finally, we incorporate practical heuristics into our transported method and evaluate it across multiple architectures and model sizes, demonstrating its flexibility and compatibility in training neural networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies nonconvex stochastic optimization over matrices R^{m x n} equipped with general norms, focusing on scale-invariant first-order methods under p-th moment heavy-tailed noise. It derives a dimension-dependent lower bound of Ω(min{m,n} ε^{-(3p-2)/(p-1)}) oracle complexity for any scale-invariant method restricted to the spectral norm when max{m,n}/(min{m,n})^2 is sufficiently large. A batched Scion method is shown to achieve a matching O bound, while a transported Scion variant improves the rate to O(min{m,n} ε^{-(5p-3)/(2p-2)}) under the additional assumption of Hessian Lipschitzness. The work concludes with practical heuristics and experiments demonstrating applicability to neural network training across architectures and scales.

Significance. If the matching lower and upper bounds hold under the stated assumptions, the results clarify unavoidable dimension dependence and complexity for scale-invariant methods in the presence of heavy-tailed noise, which is a realistic model for deep learning gradients. The improvement via the transported method under higher-order smoothness, combined with empirical validation, offers concrete guidance for optimizer design that respects parametrization and norm geometry. The explicit p-moment noise model and dimension condition make the claims falsifiable and relevant to the field.

major comments (2)
  1. [Abstract and lower-bound section] The lower bound in the abstract (and presumably §4) is stated for spectral norm, yet the problem setting is introduced with general input-output matrix norms; the manuscript should clarify whether the Ω(min{m,n} ε^{-(3p-2)/(p-1)}) rate extends to other norms or if spectral norm is necessary for the hardness construction, as this affects the generality of the central claim.
  2. [Transported Scion analysis and experiments] The transported Scion improvement to O(min{m,n} ε^{-(5p-3)/(2p-2)}) relies on Hessian Lipschitzness (abstract); the paper must specify how this assumption is verified or relaxed in the neural-network experiments, since violation could invalidate the faster rate and undermine the practical significance of the higher-order variant.
minor comments (2)
  1. [Abstract and setting] Notation for the matrix dimensions m,n and the ratio max{m,n}/(min{m,n})^2 should be introduced with a precise threshold value for 'large enough' to make the lower-bound statement self-contained.
  2. [Preliminaries] The definition of scale-invariance for the methods (used in both lower and upper bounds) would benefit from an explicit equation or property list early in the manuscript to avoid ambiguity when comparing to prior work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and constructive feedback. The comments highlight important points regarding the scope of our theoretical results and the connection to experiments. We address each major comment below and have incorporated revisions to improve clarity.

read point-by-point responses
  1. Referee: [Abstract and lower-bound section] The lower bound in the abstract (and presumably §4) is stated for spectral norm, yet the problem setting is introduced with general input-output matrix norms; the manuscript should clarify whether the Ω(min{m,n} ε^{-(3p-2)/(p-1)}) rate extends to other norms or if spectral norm is necessary for the hardness construction, as this affects the generality of the central claim.

    Authors: We agree that additional clarification is warranted. The lower bound construction in Section 4 relies on specific properties of the spectral norm (in particular, its behavior under scale-invariant updates and the choice of hard instances that exploit the operator norm geometry). The result does not directly extend to arbitrary input-output norms, for which the dimension dependence may be milder or require a different hardness argument. Our matching upper bound for the batched Scion method holds for general norms, while the lower bound is stated specifically for the spectral norm. We will revise the abstract and add a short paragraph at the end of Section 4 to make this distinction explicit, thereby strengthening the precision of the central claim without altering its substance. revision: yes

  2. Referee: [Transported Scion analysis and experiments] The transported Scion improvement to O(min{m,n} ε^{-(5p-3)/(2p-2)}) relies on Hessian Lipschitzness (abstract); the paper must specify how this assumption is verified or relaxed in the neural-network experiments, since violation could invalidate the faster rate and undermine the practical significance of the higher-order variant.

    Authors: The faster rate for the transported Scion method is derived under the additional assumption of Hessian Lipschitz continuity, which is stated clearly in the abstract and analysis. In the neural-network experiments we apply practical heuristics inspired by the transported update (e.g., approximate transport maps and adaptive batching) rather than enforcing the Hessian-Lipschitz condition, which is generally unverifiable at scale. We will expand the experimental section to explicitly note that the O(min{m,n} ε^{-(5p-3)/(2p-2)}) guarantee is theoretical, while the heuristics are motivated by the analysis and are evaluated empirically for their practical benefits even when the higher-order assumption may hold only approximately. This revision clarifies the theory-practice gap without changing the reported results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives dimension-dependent lower and upper bounds on oracle complexity for scale-invariant first-order methods under p-th moment heavy-tailed noise directly from the problem setting (spectral norm, matrix dimensions m,n, and the explicit noise moment assumption). The matching O and improved O bounds for batched and transported Scion methods follow from standard nonconvex stochastic optimization analysis without reducing to fitted parameters, self-definitional constructions, or load-bearing self-citations. The Hessian Lipschitz condition for the transported variant is an additional independent assumption that does not loop back to the core claims. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The results rest on standard domain assumptions about noise moments and smoothness; new algorithmic entities are introduced without independent evidence beyond the theoretical analysis.

axioms (2)
  • domain assumption Stochastic gradients have finite p-th moment for p > 1
    Invoked to model heavy-tailed noise and derive the specific complexity exponents.
  • domain assumption Objective is nonconvex and sufficiently smooth (Lipschitz gradient or Hessian)
    Standard assumption for nonconvex stochastic optimization analysis in the abstract setting.
invented entities (2)
  • Batched Scion method no independent evidence
    purpose: Achieves the matching upper bound for scale-invariant first-order optimization with spectral norm
    New algorithm proposed to match the lower bound.
  • Transported Scion method no independent evidence
    purpose: Exploits higher-order smoothness to improve the convergence rate under Hessian Lipschitzness
    Improved variant for the case with Lipschitz Hessian.

pith-pipeline@v0.9.0 · 5875 in / 1556 out tokens · 46578 ms · 2026-05-20T08:45:07.216128+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

150 extracted references · 150 canonical work pages · 9 internal anchors

  1. [1]

    Nature , Volume =

    Learning representations by back-propagating errors , Author =. Nature , Volume =. 1986 , Publisher =

  2. [2]

    Neural Computation , Volume =

    Adaptive mixtures of local experts , Author =. Neural Computation , Volume =. 1991 , Publisher =

  3. [3]

    Proceedings of the IEEE , Volume =

    Gradient-based learning applied to document recognition , Author =. Proceedings of the IEEE , Volume =. 2002 , Publisher =

  4. [4]

    Neural Computation , Volume =

    Long short-term memory , Author =. Neural Computation , Volume =. 1997 , Publisher =

  5. [5]

    and Van Merri

    Cho, K. and Van Merri. Learning phrase representations using. EMNLP , Pages =

  6. [6]

    Neural Computation , Volume =

    A fast learning algorithm for deep belief nets , Author =. Neural Computation , Volume =. 2006 , Publisher =

  7. [7]

    and Sutskever, I

    Krizhevsky, A. and Sutskever, I. and Hinton, G. E. , Booktitle =. Image

  8. [8]

    CVPR , Pages =

    Deep residual learning for image recognition , Author =. CVPR , Pages =. 2016 , Organization =

  9. [9]

    NeurIPS , Pages =

    Attention is all you need , Author =. NeurIPS , Pages =

  10. [10]

    The Annals of Mathematical Statistics , Pages =

    A stochastic approximation method , Author =. The Annals of Mathematical Statistics , Pages =. 1951 , Publisher =

  11. [11]

    1964 , Publisher =

    Some methods of speeding up the convergence of iteration methods , Author =. 1964 , Publisher =

  12. [12]

    Doklady Akademii Nauk , Pages =

    A method of solving a convex programming problem with convergence rate O(1/k^2) , Author =. Doklady Akademii Nauk , Pages =. 1983 , Organization =

  13. [13]

    ICML , Pages =

    On the importance of initialization and momentum in deep learning , Author =. ICML , Pages =. 2013 , Organization =

  14. [14]

    The Journal of Machine Learning Research , Volume =

    Adaptive subgradient methods for online learning and stochastic optimization , Author =. The Journal of Machine Learning Research , Volume =. 2011 , Publisher =

  15. [15]

    and Hinton, G

    Tieleman, T. and Hinton, G. E. , Year =. Neural networks for machine learning,

  16. [16]

    ICLR , Year =

    Adam: A method for stochastic optimization , Author =. ICLR , Year =

  17. [17]

    ICLR , Year =

    Decoupled weight decay regularization , Author =. ICLR , Year =

  18. [18]

    AISTATS , Pages =

    Understanding the difficulty of training deep feedforward neural networks , Author =. AISTATS , Pages =. 2010 , Publisher =

  19. [19]

    ICML , Pages =

    A tail-index analysis of stochastic gradient noise in deep neural networks , Author =. ICML , Pages =. 2019 , Organization =

  20. [20]

    NeurIPS , Pages =

    Preconditioned spectral descent for deep learning , Author =. NeurIPS , Pages =

  21. [21]

    2024 , Url =

    Muon: An optimizer for hidden layers in neural networks , Author =. 2024 , Url =

  22. [22]

    NeurIPS Workshop on Optimization for Machine Learning , Year =

    Old optimizer, new norm: An anthology , Author =. NeurIPS Workshop on Optimization for Machine Learning , Year =

  23. [23]

    and Xie, W

    Pethick, T. and Xie, W. and Antonakopoulos, K. and Zhu, Z. and Silveti-Falls, A. and Cevher, V. , Booktitle =. Training deep learning models with norm-constrained. 2025 , Organization =

  24. [24]

    NeurIPS , Pages =

    Scalable optimization in the modular norm , Author =. NeurIPS , Pages =

  25. [25]

    ICML , Pages =

    Modular duality in deep learning , Author =. ICML , Pages =. 2025 , Organization =

  26. [26]

    ICML , Pages =

    Batch normalization: Accelerating deep network training by reducing internal covariate shift , Author =. ICML , Pages =. 2015 , Organization =

  27. [27]

    NIPS Workshop on Deep Learning Symposium , Year =

    Layer normalization , Author =. NIPS Workshop on Deep Learning Symposium , Year =

  28. [28]

    and Hu, E

    Yang, G. and Hu, E. J. , Booktitle =. Tensor programs. 2021 , Organization =

  29. [29]

    and Hu, E

    Yang, G. and Hu, E. J. and Babuschkin, I. and Sidor, S. and Liu, X. and Farhi, D. and Ryder, N. and Pachocki, J. and Chen, W. and Gao, J. , Booktitle =. Tensor programs

  30. [30]

    A spectral condition for feature learning

    A spectral condition for feature learning , Author =. ArXiv Preprint: 2310.17813 , Year =

  31. [31]

    and Su, J

    Liu, J. and Su, J. and Yao, X. and Jiang, Z. and Lai, G. and Du, Y. and Qin, Y. and Xu, W. and Lu, E. and Yan, J. and others , Journal =. Muon is scalable for

  32. [32]

    Kimi K2: Open Agentic Intelligence

    Kimi k2: Open agentic intelligence , Author =. ArXiv Preprint: 2507.20534 , Year =

  33. [33]

    NeurIPS , Pages =

    Why are adaptive methods good for attention models? , Author =. NeurIPS , Pages =

  34. [34]

    NeurIPS , Pages =

    High-probability bounds for non-convex stochastic optimization with heavy tails , Author =. NeurIPS , Pages =

  35. [35]

    From gradient clipping to normalization for heavy tailed

    H. From gradient clipping to normalization for heavy tailed. AISTATS , Pages =. 2025 , Organization =

  36. [36]

    and Liu, X

    Sun, T. and Liu, X. and Yuan, K. , Journal =. Revisiting gradient normalization and clipping for nonconvex. 2025 , Publisher =

  37. [37]

    ICLR , Year =

    Nonconvex stochastic optimization under heavy-tailed Noises: Optimal convergence without gradient clipping , Author =. ICLR , Year =

  38. [38]

    and Yaroslav, K

    Chezhegov, S. and Yaroslav, K. and Semenov, A. and Beznosikov, A. and Gasnikov, A. and Horv. Clipping improves. ICML , Pages =. 2025 , Organization =

  39. [39]

    Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise

    Sign-based optimizers are effective under heavy-tailed noise , Author =. ArXiv Preprint: 2602.07425 , Year =

  40. [40]

    and AlRashed, S

    Shulgin, E. and AlRashed, S. and Richt. Beyond the ideal: Analyzing the inexact. AISTATS , Year =

  41. [41]

    Kim, G. Y. and Oh, M-h. , Booktitle =. Convergence of. 2026 , Url =

  42. [42]

    and Wang, J-K

    Sfyraki, M-E. and Wang, J-K. , Journal =. Lions and

  43. [43]

    Mathematical Programming , Volume =

    Lower bounds for non-convex stochastic optimization , Author =. Mathematical Programming , Volume =. 2023 , Publisher =

  44. [44]

    and Mehta, H

    Cutkosky, A. and Mehta, H. , Booktitle =. Momentum improves normalized. 2020 , Organization =

  45. [45]

    and Grosse, R

    Martens, J. and Grosse, R. , Booktitle =. Optimizing neural networks with. 2015 , Organization =

  46. [46]

    and Martens, J

    Grosse, R. and Martens, J. , Booktitle =. A. 2016 , Organization =

  47. [47]

    ICML , Pages =

    Shampoo: Preconditioned stochastic tensor optimization , Author =. ICML , Pages =. 2018 , Organization =

  48. [48]

    and Ren, Y

    Goldfarb, D. and Ren, Y. and Bahamou, A. , Booktitle =. Practical quasi-

  49. [49]

    NeurIPS , Pages =

    Tensor normal training for deep learning models , Author =. NeurIPS , Pages =

  50. [50]

    Duvvuri, S. S. and Devvrit, F. and Anil, R. and Hsieh, C-J. and Dhillon, I. S. , Booktitle =. Combining axes preconditioners through. 2024 , Url =

  51. [51]

    and Zhang, Z

    Zhao, J. and Zhang, Z. and Chen, B. and Wang, Z. and Anandkumar, A. and Tian, Y. , Booktitle =. GaLore: Memory-efficient. 2024 , Organization =

  52. [52]

    and Shapira, I

    Morwani, D. and Shapira, I. and Vyas, N. and Malach, E. and Kakade, S. M. and Janson, L. , Booktitle =. A new perspective on. 2025 , Url =

  53. [53]

    and Morwani, D

    Vyas, N. and Morwani, D. and Zhao, R. and Shapira, I. and Brandfonbrener, D. and Janson, L. and Kakade, S. M. , Booktitle =. 2025 , Url =

  54. [54]

    and Liu, Y

    Yuan, H. and Liu, Y. and Wu, S. and Xun, Z. and Gu, Q. , Booktitle =. 2025 , Organization =

  55. [55]

    and Liu, Y

    An, K. and Liu, Y. and Pan, R. and Ren, Y. and Ma, S. and Goldfarb, D. and Zhang, T. , Booktitle =. 2025 , Url =

  56. [56]

    and Liu, L

    Li, Z. and Liu, L. and Liang, C. and Chen, W. and Zhao, T. , Journal =. Nor

  57. [57]

    and Shulgin, E

    Riabinin, A. and Shulgin, E. and Gruntkowska, K. and Richt. Gluon: Making. ICML Workshop on High-dimensional Learning Dynamics , Year =

  58. [58]

    Dion: Distributed Orthonormalized Updates

    Dion: Distributed orthonormalized updates , Author =. ArXiv Preprint: 2504.05295 , Year =

  59. [59]

    and Amsel, N

    Ahn, K. and Amsel, N. and Langford, J. , Journal =. Dion2: A simple method to shrink matrix in

  60. [60]

    ArXiv Preprint: 2505.21799 , Year =

    Polargrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective , Author =. ArXiv Preprint: 2505.21799 , Year =

  61. [61]

    Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

    Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training , Author =. ArXiv Preprint: 2509.11983 , Year =

  62. [62]

    and Luo, Y

    Huang, F. and Luo, Y. and Chen, S. , Journal =. Limuon: Light and fast

  63. [63]

    and Joshi, A

    Page, S. and Joshi, A. and Sonawane, S. S. , Journal =. Muon

  64. [64]

    and Yan, W

    Xu, C. and Yan, W. and Zhang, Y-J. A. , Journal =

  65. [65]

    and Xie, Z

    Gu, Y. and Xie, Z. , Journal =

  66. [66]

    and Zazo, J

    Gong, W. and Zazo, J. and Luo, Q. and Wang, P. and Hensman, J. and Ma, C. , Journal =

  67. [67]

    and Liu, Y

    Zhang, M. and Liu, Y. and Schaeffer, H. , Journal =. Adam improves

  68. [68]

    and Su, W

    Du, Z. and Su, W. , Journal =. The

  69. [69]

    and Persson, D

    Amsel, N. and Persson, D. and Musco, C. and Gower, R. M. , Booktitle =. The. 2026 , Url =

  70. [70]

    and Amsel, N

    Zhang, J. and Amsel, N. and Chen, B. and Dao, T. , Year =. Gram

  71. [71]

    and Simsekli, U

    Gurbuzbalaban, M. and Simsekli, U. and Zhu, L. , Booktitle =. The heavy-tail phenomenon in. 2021 , Organization =

  72. [72]

    and Milligan, A

    Kunstner, F. and Milligan, A. and Yadav, R. and Schmidt, M. and Bietti, A. , Booktitle =. Heavy-tailed class imbalance and why

  73. [73]

    and Bach, F

    Kunstner, F. and Bach, F. , Booktitle =. Scaling laws for gradient descent and sign descent for linear bigram models under. 2025 , Url =

  74. [74]

    and Fang, A

    Li, J. and Fang, A. and Smyrnis, G. and Ivgi, M. and Jordan, M. and Gadre, S. and Bansal, H. and Guha, E. and Keh, S. and Arora, K. and others , Booktitle =. Data

  75. [75]

    , Year =

    Karpath, A. , Year =. nanochat: The best

  76. [76]

    and Yang, Y

    Diao, S. and Yang, Y. and Fu, Y. and Dong, X. and Su, D. and Kliegl, M. and Chen, Z. and Belcak, P. and Suhara, Y. and Yin, H. and others , Journal =. Nemotron-

  77. [77]

    , Booktitle =

    Dozat, T. , Booktitle =. Incorporating. 2016 , Url =

  78. [78]

    ArXiv Preprint: 2404.00498 , Year =

    94\ Author =. ArXiv Preprint: 2404.00498 , Year =

  79. [79]

    2009 , Month = apr, Url =

    Learning multiple layers of features from tiny images , Author =. 2009 , Month = apr, Url =

  80. [80]

    NeurIPS , Pages =

    The road less scheduled , Author =. NeurIPS , Pages =

Showing first 80 references.