pith. machine review for the scientific record. sign in

arxiv: 2510.09378 · v2 · submitted 2025-10-10 · 💻 cs.LG · cs.AI

Recognition: unknown

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

Authors on Pith no claims yet
classification 💻 cs.LG cs.AI
keywords fulllayerwiseperformanceapproximationsgainsgauss-newtoninformationpotential
0
0 comments X
read the original abstract

Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much performance is forfeited by these approximations? To probe this question, we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters. Our experiments show that full GN updates yield substantial gains over existing optimizers, achieving a 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon. Furthermore, we find that a precise layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method. Collectively, our results suggest: (1) the GN approximation is highly effective for preconditioning, implying higher-order loss terms may not be critical for convergence speed; (2) the layerwise Hessian structure contains sufficient information to achieve most of these potential gains; and (3) a significant performance gap exists between current approximate methods and an idealized layerwise oracle.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Error whitening: Why Gauss-Newton outperforms Newton

    cs.LG 2026-05 conditional novelty 6.0

    Gauss-Newton descent whitens errors by projecting Newton directions or gradients onto the tangent space, replacing JJ^T with the identity and removing parameterization distortions that affect Newton descent.

  2. RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization

    cs.LG 2026-03 conditional novelty 5.0

    RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.