hub

Soap: Improving and stabilizing shampoo using adam

Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, Sham Kakade · 2024 · arXiv 2409.11321

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

read on arXiv browse 19 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

A meshfree exterior calculus for generalizable and data-efficient learning of physics from point clouds

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

MEEC equips point clouds with a discrete exterior calculus that satisfies exact conservation and is differentiable in point positions, allowing a single trained kernel to produce compatible physics on unseen geometries and parameters.

Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).

Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

Partial orthogonalization from power iteration accelerates zeroth-order Muon by 1.5x-4x on LLM fine-tuning tasks while maintaining competitive accuracy.

Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.

When Descent Is Too Stable: Event-Triggered Hamiltonian Learning to Optimize

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

SHAPE lifts gradient descent to an augmented phase space with a learned Hamiltonian vector field and event-triggered port updates to balance descent, exploitation, and exploration, improving best-so-far performance over fixed-policy methods in nonconvex tasks.

A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo

cs.LG · 2026-04-19 · unverdicted · novelty 7.0

A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.

Hard-constrained Physics-informed Neural Networks for Interface Problems

math.NA · 2026-04-09 · conditional · novelty 7.0

Hard-constrained PINN formulations via windowing and buffer approaches enforce interface conditions by design and outperform soft-constrained baselines on 1D and 2D elliptic interface problems.

Toward AI-Driven Digital Twins for Metropolitan Floods: A Conditional Latent Dynamics Network Surrogate of the Shallow Water Equations

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

CLDNet is a conditional latent dynamics network surrogate for the shallow water equations that delivers 115x faster 96-hour flood forecasts on irregular metropolitan basins while maintaining usable accuracy against gauge data.

GRAFT-ATHENA: Self-Improving Agentic Teams for Autonomous Discovery and Evolutionary Numerical Algorithms

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

GRAFT-ATHENA projects combinatorial method choices into factored trees that embed as fingerprints in a metric space, enabling an agentic system to accumulate experience across domains and autonomously discover new numerical techniques for physics-informed problems.

OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training loss than Muon+Moonlight and AdamW.

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.

Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Pro-KLShampoo projects KL-Shampoo preconditioners to a spike-and-flat parametric form on an r-dimensional subspace and recovers the full algebraic preconditioner via orthogonalization, outperforming KL-Shampoo on GPT-2 and LLaMA pre-training scales.

Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.

Large-eddy simulation nets (LESnets) based on physics-informed neural operator for wall-bounded turbulence

physics.flu-dyn · 2026-04-29 · unverdicted · novelty 6.0

LESnets integrates LES equations and the law of the wall into F-FNO to enable data-free, stable long-term predictions of wall-bounded turbulence at Re_tau up to 1000 on coarse grids, matching traditional LES accuracy at higher efficiency.

When PINNs Go Wrong: Pseudo-Time Stepping Against Spurious Solutions

cs.LG · 2026-04-26 · conditional · novelty 6.0

PINNs fail on spurious solutions admitted by the residual loss; adaptive pseudo-time stepping with Jacobian-based step selection improves accuracy and robustness on PDE benchmarks.

$\phi-$DeepONet: A Discontinuity Capturing Neural Operator

cs.CE · 2026-04-09 · unverdicted · novelty 6.0

φ-DeepONet learns mappings with discontinuities in inputs and outputs by combining multiple branch networks with a nonlinear interface embedding in the trunk, trained via physics- and interface-informed loss, and shows accurate results on 1D/2D benchmarks.

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

cs.LG · 2026-03-30 · unverdicted · novelty 6.0

MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

cs.LG · 2026-05-12 · unverdicted · novelty 5.0

Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.

Curvature-Aware Optimization for High-Accuracy Physics-Informed Neural Networks

cs.LG · 2026-04-06 · unverdicted · novelty 4.0

Curvature-aware optimizers such as natural gradient and self-scaling BFGS/Broyden accelerate PINN convergence and accuracy on PDEs including Helmholtz, Stokes, Burgers, and Euler equations plus stiff ODEs, with new model formulations and batched scaling.

citing papers explorer

Showing 19 of 19 citing papers.

A meshfree exterior calculus for generalizable and data-efficient learning of physics from point clouds cs.LG · 2026-05-08 · unverdicted · none · ref 67
MEEC equips point clouds with a discrete exterior calculus that satisfies exact conservation and is differentiable in point positions, allowing a single trained kernel to produce compatible physics on unseen geometries and parameters.
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters cs.LG · 2026-05-12 · unverdicted · none · ref 65
Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration cs.LG · 2026-05-09 · unverdicted · none · ref 17
Partial orthogonalization from power iteration accelerates zeroth-order Muon by 1.5x-4x on LLM fine-tuning tasks while maintaining competitive accuracy.
Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms cs.LG · 2026-05-08 · unverdicted · none · ref 16
Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.
When Descent Is Too Stable: Event-Triggered Hamiltonian Learning to Optimize cs.LG · 2026-05-07 · unverdicted · none · ref 47
SHAPE lifts gradient descent to an augmented phase space with a learned Hamiltonian vector field and event-triggered port updates to balance descent, exploitation, and exploration, improving best-so-far performance over fixed-policy methods in nonconvex tasks.
A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo cs.LG · 2026-04-19 · unverdicted · none · ref 48
A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.
Hard-constrained Physics-informed Neural Networks for Interface Problems math.NA · 2026-04-09 · conditional · none · ref 25
Hard-constrained PINN formulations via windowing and buffer approaches enforce interface conditions by design and outperform soft-constrained baselines on 1D and 2D elliptic interface problems.
Toward AI-Driven Digital Twins for Metropolitan Floods: A Conditional Latent Dynamics Network Surrogate of the Shallow Water Equations cs.LG · 2026-05-13 · unverdicted · none · ref 38
CLDNet is a conditional latent dynamics network surrogate for the shallow water equations that delivers 115x faster 96-hour flood forecasts on irregular metropolitan basins while maintaining usable accuracy against gauge data.
GRAFT-ATHENA: Self-Improving Agentic Teams for Autonomous Discovery and Evolutionary Numerical Algorithms cs.LG · 2026-05-11 · unverdicted · none · ref 51
GRAFT-ATHENA projects combinatorial method choices into factored trees that embed as fingerprints in a metric space, enabling an agentic system to accumulate experience across domains and autonomously discover new numerical techniques for physics-informed problems.
OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling cs.LG · 2026-05-08 · unverdicted · none · ref 19
OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training loss than Muon+Moonlight and AdamW.
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less cs.LG · 2026-05-07 · unverdicted · none · ref 30
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization cs.LG · 2026-05-07 · unverdicted · none · ref 17
Pro-KLShampoo projects KL-Shampoo preconditioners to a spike-and-flat parametric form on an r-dimensional subspace and recovers the full algebraic preconditioner via orthogonalization, outperforming KL-Shampoo on GPT-2 and LLaMA pre-training scales.
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio cs.LG · 2026-05-07 · unverdicted · none · ref 36
MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
Large-eddy simulation nets (LESnets) based on physics-informed neural operator for wall-bounded turbulence physics.flu-dyn · 2026-04-29 · unverdicted · none · ref 100
LESnets integrates LES equations and the law of the wall into F-FNO to enable data-free, stable long-term predictions of wall-bounded turbulence at Re_tau up to 1000 on coarse grids, matching traditional LES accuracy at higher efficiency.
When PINNs Go Wrong: Pseudo-Time Stepping Against Spurious Solutions cs.LG · 2026-04-26 · conditional · none · ref 63
PINNs fail on spurious solutions admitted by the residual loss; adaptive pseudo-time stepping with Jacobian-based step selection improves accuracy and robustness on PDE benchmarks.
$\phi-$DeepONet: A Discontinuity Capturing Neural Operator cs.CE · 2026-04-09 · unverdicted · none · ref 41
φ-DeepONet learns mappings with discontinuities in inputs and outputs by combining multiple branch networks with a nonlinear interface embedding in the trunk, trained via physics- and interface-informed loss, and shows accurate results on 1D/2D benchmarks.
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration cs.LG · 2026-03-30 · unverdicted · none · ref 19
MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation cs.LG · 2026-05-12 · unverdicted · none · ref 78
Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
Curvature-Aware Optimization for High-Accuracy Physics-Informed Neural Networks cs.LG · 2026-04-06 · unverdicted · none · ref 73
Curvature-aware optimizers such as natural gradient and self-scaling BFGS/Broyden accelerate PINN convergence and accuracy on PDEs including Helmholtz, Stokes, Burgers, and Euler equations plus stiff ODEs, with new model formulations and batched scaling.

Soap: Improving and stabilizing shampoo using adam

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer