hub Mixed citations

SOAP: Improving and Stabilizing Shampoo using Adam

Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener · 2024 · cs.LG · arXiv 2409.11321

Mixed citation behavior. Most common role is background (55%).

46 Pith papers citing it

Background 55% of classified citations

open full Pith review browse 46 citing papers arXiv PDF

abstract

There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks include additional hyperparameters and computational overhead when compared to Adam, which only updates running averages of first- and second-moment quantities. This work establishes a formal connection between Shampoo (implemented with the 1/2 power) and Adafactor -- a memory-efficient approximation of Adam -- showing that Shampoo is equivalent to running Adafactor in the eigenbasis of Shampoo's preconditioner. This insight leads to the design of a simpler and computationally efficient algorithm: $\textbf{S}$hampo$\textbf{O}$ with $\textbf{A}$dam in the $\textbf{P}$reconditioner's eigenbasis (SOAP). With regards to improving Shampoo's computational efficiency, the most straightforward approach would be to simply compute Shampoo's eigendecomposition less frequently. Unfortunately, as our empirical results show, this leads to performance degradation that worsens with this frequency. SOAP mitigates this degradation by continually updating the running average of the second moment, just as Adam does, but in the current (slowly changing) coordinate basis. Furthermore, since SOAP is equivalent to running Adam in a rotated space, it introduces only one additional hyperparameter (the preconditioning frequency) compared to Adam. We empirically evaluate SOAP on language model pre-training with 360m and 660m sized models. In the large batch regime, SOAP reduces the number of iterations by over 40% and wall clock time by over 35% compared to AdamW, with approximately 20% improvements in both metrics compared to Shampoo. An implementation of SOAP is available at https://github.com/nikhilvyas/SOAP.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 method 2 baseline 1

citation-polarity summary

background 6 unclear 2 use method 2 baseline 1

representative citing papers

A meshfree exterior calculus for generalizable and data-efficient learning of physics from point clouds

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

MEEC equips point clouds with a discrete exterior calculus that satisfies exact conservation and is differentiable in point positions, allowing a single trained kernel to produce compatible physics on unseen geometries and parameters.

Training for the Model You Return: Improving Optimization for Iterate-Averaged Language Models

cs.LG · 2026-06-23 · unverdicted · novelty 7.0

PACE is a clipped per-coordinate controller added to AdamW that improves the limiting error of the returned iterate average in both quadratic analysis and LM experiments.

Why Muon Outperforms Adam: A Curvature Perspective

cs.LG · 2026-06-03 · conditional · novelty 7.0

Muon outperforms Adam by reducing curvature penalty via lower Normalized Directional Sharpness, as shown via Taylor approximation on LLM training and proven on stylized quadratic problems with heterogeneous curvature.

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

cs.LG · 2026-05-19 · conditional · novelty 7.0

Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.

Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).

Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.

When Descent Is Too Stable: Event-Triggered Hamiltonian Learning to Optimize

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

SHAPE lifts gradient descent to an augmented phase space with a learned Hamiltonian vector field and event-triggered port updates to balance descent, exploitation, and exploration, improving best-so-far performance over fixed-policy methods in nonconvex tasks.

A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo

cs.LG · 2026-04-19 · unverdicted · novelty 7.0

A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.

Hard-constrained Physics-informed Neural Networks for Interface Problems

math.NA · 2026-04-09 · conditional · novelty 7.0 · 2 refs

Windowing and buffer hard-constrained PINNs enforce interface physics by design, yielding higher interface fidelity than soft-constrained baselines on elliptic benchmarks.

Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

cs.LG · 2026-03-27 · unverdicted · novelty 7.0

Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing faster initial multi-step dynamics.

Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model

cond-mat.dis-nn · 2026-02-04 · unverdicted · novelty 7.0

In a random feature model, optimal SGD learning-rate schedules are polynomial decay in the easy phase and warmup-stable-decay in the hard phase, outperforming constant or simple power-law schedules and transferring differently across training horizons.

One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

cs.LG · 2026-06-29 · unverdicted · novelty 6.0

One-step gradient delay is optimizer-dependent rather than intrinsically unstable, with Muon and error-feedback correction enabling async pipeline parallelism to match synchronous performance on models up to 10B parameters.

Bridging Ab Initio Symmetries and Global Nuclear Masses with Interpretable Neural Networks

nucl-th · 2026-06-26 · unverdicted · novelty 6.0

Symmetry-informed neural networks using SU(3)/SU(4) Casimir operators achieve lower RMSE on global nuclear masses than liquid-drop models, with WINN reaching 0.430 MeV validation error and showing dripline and superheavy patterns.

Quantifying the Agreement Between Data-Influence and Data-Similarity to Understand LLM Behavior

cs.LG · 2026-06-22 · unverdicted · novelty 6.0

Data-similarity and data-influence produce significantly overlapping rankings of training documents for LLM outputs, with asymmetry allowing a favorable cost-accuracy trade-off.

Stochastic convergence of parallel asynchronous adaptive first-order methods

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

Introduces a class of asynchronous adaptive first-order methods and establishes O(1/sqrt t) convergence (up to logs) for non-convex stochastic optimization under reasonable assumptions.

Global Convergence and Error Propagation in Neural Gradient Flows: A Riemannian Optimization Framework

math.OC · 2026-05-26 · unverdicted · novelty 6.0

Establishes Riemannian gradient flow equivalence for neural MMS steps, linear convergence under convexity conditions, and O(δ) tracking bounds for inexact iterates.

WINO: A Weak-Form Physics Informed Neural Operator for Hyperelasticity on Variable Domains

math.NA · 2026-05-23 · unverdicted · novelty 6.0

WINO is a weak-form physics-informed neural operator for hyperelasticity on variable domains that uses phi-FEM for geometric flexibility and achieves accuracy below 0.04 while cutting computation time by 50-80% as warm starts for solvers.

Coupling-Robust Accuracy in Multiphysics Physics Informed Neural Networks via Kronecker-Preconditioned Optimization

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Block-diagonal Gauss-Newton preconditioning bounds the preconditioned NTK spectral radius by the number of networks independent of coupling strength, enabling coupling-robust accuracy in multiphysics PINNs via SOAP+GN.

Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

SGD is reformulated via a master equation from discrete updates, producing a discrete Fokker-Planck equation that predicts non-stationary variance growth proportional to learning rate in flat Hessian directions.

Perfect Parallelization in Mini-Batch SGD with Classical Momentum Acceleration

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

Classical momentum acceleration in mini-batch SGD for quadratics is proportional to batch size up to saturation, enabling perfect parallelization under minimal noise assumptions.

Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training

cs.DC · 2026-05-15 · unverdicted · novelty 6.0

Asteria is a runtime system that enables second-order optimization for LLMs by dynamically distributing optimizer state across GPU, CPU, and NVMe while using asynchronous inverse-root computations and bounded-staleness synchronization.

Toward AI-Driven Digital Twins for Metropolitan Floods: A Conditional Latent Dynamics Network Surrogate of the Shallow Water Equations

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

CLDNet is a conditional latent dynamics network surrogate for the shallow water equations that delivers 115x faster 96-hour flood forecasts on irregular metropolitan basins while maintaining usable accuracy against gauge data.

GRAFT-ATHENA: Self-Improving Agentic Teams for Autonomous Discovery and Evolutionary Numerical Algorithms

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

GRAFT-ATHENA projects combinatorial method choices into factored trees that embed as fingerprints in a metric space, enabling an agentic system to accumulate experience across domains and autonomously discover new numerical techniques for physics-informed problems.

Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration

cs.LG · 2026-05-09 · conditional · novelty 6.0 · 2 refs

ZO-MOPI accelerates zeroth-order LLM fine-tuning by applying partial spectral orthogonalization from power iteration inside a momentum-projected subspace to reduce variance and exploit dominant directions.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Global Convergence and Error Propagation in Neural Gradient Flows: A Riemannian Optimization Framework math.OC · 2026-05-26 · unverdicted · none · ref 15 · internal anchor
Establishes Riemannian gradient flow equivalence for neural MMS steps, linear convergence under convexity conditions, and O(δ) tracking bounds for inexact iterates.

SOAP: Improving and Stabilizing Shampoo using Adam

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer