MEEC equips point clouds with a discrete exterior calculus that satisfies exact conservation and is differentiable in point positions, allowing a single trained kernel to produce compatible physics on unseen geometries and parameters.
hub Mixed citations
SOAP: Improving and Stabilizing Shampoo using Adam
Mixed citation behavior. Most common role is background (55%).
abstract
There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks include additional hyperparameters and computational overhead when compared to Adam, which only updates running averages of first- and second-moment quantities. This work establishes a formal connection between Shampoo (implemented with the 1/2 power) and Adafactor -- a memory-efficient approximation of Adam -- showing that Shampoo is equivalent to running Adafactor in the eigenbasis of Shampoo's preconditioner. This insight leads to the design of a simpler and computationally efficient algorithm: $\textbf{S}$hampo$\textbf{O}$ with $\textbf{A}$dam in the $\textbf{P}$reconditioner's eigenbasis (SOAP). With regards to improving Shampoo's computational efficiency, the most straightforward approach would be to simply compute Shampoo's eigendecomposition less frequently. Unfortunately, as our empirical results show, this leads to performance degradation that worsens with this frequency. SOAP mitigates this degradation by continually updating the running average of the second moment, just as Adam does, but in the current (slowly changing) coordinate basis. Furthermore, since SOAP is equivalent to running Adam in a rotated space, it introduces only one additional hyperparameter (the preconditioning frequency) compared to Adam. We empirically evaluate SOAP on language model pre-training with 360m and 660m sized models. In the large batch regime, SOAP reduces the number of iterations by over 40% and wall clock time by over 35% compared to AdamW, with approximately 20% improvements in both metrics compared to Shampoo. An implementation of SOAP is available at https://github.com/nikhilvyas/SOAP.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.
SHAPE lifts gradient descent to an augmented phase space with a learned Hamiltonian vector field and event-triggered port updates to balance descent, exploitation, and exploration, improving best-so-far performance over fixed-policy methods in nonconvex tasks.
A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.
Windowing and buffer hard-constrained PINNs enforce interface physics by design, yielding higher interface fidelity than soft-constrained baselines on elliptic benchmarks.
Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing faster initial multi-step dynamics.
In a random feature model, optimal SGD learning-rate schedules are polynomial decay in the easy phase and warmup-stable-decay in the hard phase, outperforming constant or simple power-law schedules and transferring differently across training horizons.
Data-similarity and data-influence produce significantly overlapping rankings of training documents for LLM outputs, with asymmetry allowing a favorable cost-accuracy trade-off.
Block-diagonal Gauss-Newton preconditioning bounds the preconditioned NTK spectral radius by the number of networks independent of coupling strength, enabling coupling-robust accuracy in multiphysics PINNs via SOAP+GN.
SGD is reformulated via a master equation from discrete updates, producing a discrete Fokker-Planck equation that predicts non-stationary variance growth proportional to learning rate in flat Hessian directions.
Classical momentum acceleration in mini-batch SGD for quadratics is proportional to batch size up to saturation, enabling perfect parallelization under minimal noise assumptions.
Asteria is a runtime system that enables second-order optimization for LLMs by dynamically distributing optimizer state across GPU, CPU, and NVMe while using asynchronous inverse-root computations and bounded-staleness synchronization.
CLDNet is a conditional latent dynamics network surrogate for the shallow water equations that delivers 115x faster 96-hour flood forecasts on irregular metropolitan basins while maintaining usable accuracy against gauge data.
GRAFT-ATHENA projects combinatorial method choices into factored trees that embed as fingerprints in a metric space, enabling an agentic system to accumulate experience across domains and autonomously discover new numerical techniques for physics-informed problems.
ZO-MOPI accelerates zeroth-order LLM fine-tuning by applying partial spectral orthogonalization from power iteration inside a momentum-projected subspace to reduce variance and exploit dominant directions.
OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training loss than Muon+Moonlight and AdamW.
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
Pro-KLShampoo projects KL-Shampoo preconditioners to a spike-and-flat parametric form on an r-dimensional subspace and recovers the full algebraic preconditioner via orthogonalization, outperforming KL-Shampoo on GPT-2 and LLaMA pre-training scales.
MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
LESnets integrates LES equations and the law of the wall into F-FNO to enable data-free, stable long-term predictions of wall-bounded turbulence at Re_tau up to 1000 on coarse grids, matching traditional LES accuracy at higher efficiency.
PINNs fail on spurious solutions admitted by the residual loss; adaptive pseudo-time stepping with Jacobian-based step selection improves accuracy and robustness on PDE benchmarks.
φ-DeepONet learns mappings with discontinuities in inputs and outputs by combining multiple branch networks with a nonlinear interface embedding in the trunk, trained via physics- and interface-informed loss, and shows accurate results on 1D/2D benchmarks.
citing papers explorer
-
A meshfree exterior calculus for generalizable and data-efficient learning of physics from point clouds
MEEC equips point clouds with a discrete exterior calculus that satisfies exact conservation and is differentiable in point positions, allowing a single trained kernel to produce compatible physics on unseen geometries and parameters.
-
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
-
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
-
Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms
Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.
-
When Descent Is Too Stable: Event-Triggered Hamiltonian Learning to Optimize
SHAPE lifts gradient descent to an augmented phase space with a learned Hamiltonian vector field and event-triggered port updates to balance descent, exploitation, and exploration, improving best-so-far performance over fixed-policy methods in nonconvex tasks.
-
A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo
A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.
-
Hard-constrained Physics-informed Neural Networks for Interface Problems
Windowing and buffer hard-constrained PINNs enforce interface physics by design, yielding higher interface fidelity than soft-constrained baselines on elliptic benchmarks.
-
Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory
Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing faster initial multi-step dynamics.
-
Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model
In a random feature model, optimal SGD learning-rate schedules are polynomial decay in the easy phase and warmup-stable-decay in the hard phase, outperforming constant or simple power-law schedules and transferring differently across training horizons.
-
Quantifying the Agreement Between Data-Influence and Data-Similarity to Understand LLM Behavior
Data-similarity and data-influence produce significantly overlapping rankings of training documents for LLM outputs, with asymmetry allowing a favorable cost-accuracy trade-off.
-
Coupling-Robust Accuracy in Multiphysics Physics Informed Neural Networks via Kronecker-Preconditioned Optimization
Block-diagonal Gauss-Newton preconditioning bounds the preconditioned NTK spectral radius by the number of networks independent of coupling strength, enabling coupling-robust accuracy in multiphysics PINNs via SOAP+GN.
-
Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics
SGD is reformulated via a master equation from discrete updates, producing a discrete Fokker-Planck equation that predicts non-stationary variance growth proportional to learning rate in flat Hessian directions.
-
Perfect Parallelization in Mini-Batch SGD with Classical Momentum Acceleration
Classical momentum acceleration in mini-batch SGD for quadratics is proportional to batch size up to saturation, enabling perfect parallelization under minimal noise assumptions.
-
Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training
Asteria is a runtime system that enables second-order optimization for LLMs by dynamically distributing optimizer state across GPU, CPU, and NVMe while using asynchronous inverse-root computations and bounded-staleness synchronization.
-
Toward AI-Driven Digital Twins for Metropolitan Floods: A Conditional Latent Dynamics Network Surrogate of the Shallow Water Equations
CLDNet is a conditional latent dynamics network surrogate for the shallow water equations that delivers 115x faster 96-hour flood forecasts on irregular metropolitan basins while maintaining usable accuracy against gauge data.
-
GRAFT-ATHENA: Self-Improving Agentic Teams for Autonomous Discovery and Evolutionary Numerical Algorithms
GRAFT-ATHENA projects combinatorial method choices into factored trees that embed as fingerprints in a metric space, enabling an agentic system to accumulate experience across domains and autonomously discover new numerical techniques for physics-informed problems.
-
Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration
ZO-MOPI accelerates zeroth-order LLM fine-tuning by applying partial spectral orthogonalization from power iteration inside a momentum-projected subspace to reduce variance and exploit dominant directions.
-
OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling
OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training loss than Muon+Moonlight and AdamW.
-
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
-
Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization
Pro-KLShampoo projects KL-Shampoo preconditioners to a spike-and-flat parametric form on an r-dimensional subspace and recovers the full algebraic preconditioner via orthogonalization, outperforming KL-Shampoo on GPT-2 and LLaMA pre-training scales.
-
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio
MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
-
Large-eddy simulation nets (LESnets) based on physics-informed neural operator for wall-bounded turbulence
LESnets integrates LES equations and the law of the wall into F-FNO to enable data-free, stable long-term predictions of wall-bounded turbulence at Re_tau up to 1000 on coarse grids, matching traditional LES accuracy at higher efficiency.
-
When PINNs Go Wrong: Pseudo-Time Stepping Against Spurious Solutions
PINNs fail on spurious solutions admitted by the residual loss; adaptive pseudo-time stepping with Jacobian-based step selection improves accuracy and robustness on PDE benchmarks.
-
$\phi-$DeepONet: A Discontinuity Capturing Neural Operator
φ-DeepONet learns mappings with discontinuities in inputs and outputs by combining multiple branch networks with a nonlinear interface embedding in the trunk, trained via physics- and interface-informed loss, and shows accurate results on 1D/2D benchmarks.
-
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration
MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
-
Spectral Condition for $\mu$P under Width-Depth Scaling
A unified spectral condition for μP under width-depth scaling reveals a transition at k=1 vs k≥2 transformations per residual block and enables stable feature learning for practical architectures like Transformers.
-
Unsupervised simulation of incompressible flows with physics- and equality- constrained artificial neural networks
A pressure-Poisson objective combined with equality-constrained neural networks and adaptive viscosity enables unsupervised simulation of high-Reynolds-number incompressible flows including spontaneous vortex shedding.
-
Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods
Preconditioned matrix norms unify steepest descent, quasi-Newton, and adaptive optimizers, revealing SGD, Adam, Muon, KL-Shampoo, SOAP, and SPlus as special cases and enabling new methods MuAdam and MuAdam-SANIA that are competitive in experiments.
-
Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models
Introduces ML-FOP-SOAP optimizer using Fisher-Orthogonal Projection and hierarchical folding to mitigate modality competition in multimodal autoregressive training, reporting gains over AdamW on Janus and Emu3.
-
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
-
HTMuon: Improving Muon via Heavy-Tailed Spectral Correction
HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.
-
GradPower: Powering Gradients for Faster Language Model Pre-Training
GradPower applies sign-power to gradients before optimization and achieves lower terminal loss in language model pre-training across architectures, scales, datasets, and schedules.
-
On the Convergence Analysis of Muon
Convergence analysis shows Muon outperforms gradient descent by exploiting low-rank structure in neural network Hessians.
-
Curvature-Aware Optimization for High-Accuracy Physics-Informed Neural Networks
Curvature-aware optimizers such as natural gradient and self-scaling BFGS/Broyden accelerate PINN convergence and accuracy on PDEs including Helmholtz, Stokes, Burgers, and Euler equations plus stiff ODEs, with new model formulations and batched scaling.