hub Canonical reference

The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

· 2025 · cs.LG · arXiv 2505.16932

Canonical reference. 83% of citing Pith papers cite this work as background.

19 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 19 citing papers arXiv PDF

abstract

Computing the polar decomposition and the related matrix sign function has been a well-studied problem in numerical analysis for decades. Recently, it has emerged as an important subroutine within the Muon optimizer for training deep neural networks. However, the requirements of this application differ sharply from classical settings: deep learning demands GPU-friendly algorithms that prioritize high throughput over high precision. We introduce Polar Express, a new method for computing the polar decomposition. Like Newton-Schulz and other classical polynomial methods, our approach uses only matrix-matrix multiplications, making it very efficient on GPUs. Inspired by earlier work of Chen & Chow and Nakatsukasa & Freund, Polar Express adapts the update rule at each iteration by solving a minimax optimization problem. We prove that this strategy minimizes error in a worst-case sense, allowing Polar Express to converge as rapidly as possible both in the early iterations and asymptotically. We also address finite-precision issues, making it practical to use in bfloat16. When integrated into Muon, our method yields consistent improvements in validation loss for a GPT-2 model trained on one to ten billion tokens from the FineWeb dataset, outperforming recent alternatives across a range of learning rates.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 5 unclear 1

representative citing papers

Recursive expansion of the matrix step function using polynomials of degree eight

math.NA · 2026-06-23 · unverdicted · novelty 7.0

Recursive polynomial expansion for the matrix step function uses degree-eight components evaluated in three matrix multiplications to reduce overall multiplication count versus prior recursive methods.

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

cs.LG · 2026-05-19 · conditional · novelty 7.0

Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.

Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).

Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.

Frank-Wolfe with Moreau Envelope Smoothing for Nonsmooth Nonconvex Problems

math.OC · 2026-05-30 · unverdicted · novelty 6.0

FRAMES uses Moreau envelope smoothing with Frank-Wolfe steps for nonsmooth nonconvex problems, proving convergence rates under mild assumptions and highlighting a new gap relationship.

Distance-Aware Muon: Adaptive Step Scaling for Normalized Optimization

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

Introduces Distance-Adaptive Muon, Scale-Calibrated Muon, and Distance-Free Muon with stationarity and O(1/T) objective-gap guarantees, shown to match or improve fixed-scale Muon on GPT-124M and ViT-Tiny models.

Elastic Attention Cores for Scalable Vision Transformers

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.

PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix Optimisation

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

PolarAdamW disentangles spectral control from gauge-equivariance in matrix optimizers, with experiments demonstrating their distinct roles on standard versus symmetry-aware neural networks.

Parcae: Scaling Laws For Stable Looped Language Models

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

cs.LG · 2026-03-30 · unverdicted · novelty 6.0

MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.

Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods

cs.LG · 2025-10-12 · unverdicted · novelty 6.0

Preconditioned matrix norms unify steepest descent, quasi-Newton, and adaptive optimizers, revealing SGD, Adam, Muon, KL-Shampoo, SOAP, and SPlus as special cases and enabling new methods MuAdam and MuAdam-SANIA that are competitive in experiments.

Softsign: Smooth Sign in Your Optimizer For Better Parameter Heterogeneity Handling

cs.LG · 2026-05-29 · unverdicted · novelty 5.0

SoftSignum replaces hard sign with soft-sign in optimizers via temperature control and quantile scheduling, extends to SoftMuon, provides a convergence proof for stochastic non-convex settings, and reports better performance than sign-based methods and AdamW on deep learning tasks.

Convergence of Spectral Descent for Non-smooth Optimization

cs.LG · 2026-05-26 · unverdicted · novelty 5.0

Proves linear convergence of Spectral Descent (SD) and Truncated SD for non-smooth convex problems under stated conditions, sublinear rates for regularized versions via Frank-Wolfe, and recovery guarantees for robust low-rank matrix recovery.

Anytime Training with Schedule-Free Spectral Optimization

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

cs.LG · 2026-05-12 · unverdicted · novelty 5.0

Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.

Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives

math.OC · 2026-05-12 · unverdicted · novelty 5.0

Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.

A Note on Stability for Orthogonalized Matrix Momentum with Client Sampling

cs.LG · 2026-06-01 · unverdicted · novelty 4.0

Derives finite-round upper-tail guarantee on population-empirical gap for client-sampled orthogonalized matrix momentum under heterogeneous data, with Lipschitz condition on the orthogonalizer.

MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training

cs.LG · 2026-02-25 · unverdicted · novelty 4.0

Muon+ adds one normalization step after polar orthogonalization in the Muon optimizer, yielding lower training and validation perplexity and faster pre-training across 60M-7B models.

Multi-Gate Residuals

cs.LG · 2026-05-22 · unverdicted · novelty 3.0

Multi-Gate Residuals stabilizes activation scales in deep residual networks via multi-stream gating and attention pooling without added communication overhead.

citing papers explorer

Showing 19 of 19 citing papers.

Recursive expansion of the matrix step function using polynomials of degree eight math.NA · 2026-06-23 · unverdicted · none · ref 3 · internal anchor
Recursive polynomial expansion for the matrix step function uses degree-eight components evaluated in three matrix multiplications to reduce overall multiplication count versus prior recursive methods.
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR cs.LG · 2026-05-19 · conditional · none · ref 2 · internal anchor
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters cs.LG · 2026-05-12 · unverdicted · none · ref 3 · internal anchor
Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds cs.LG · 2026-05-10 · unverdicted · none · ref 3 · internal anchor
Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.
Frank-Wolfe with Moreau Envelope Smoothing for Nonsmooth Nonconvex Problems math.OC · 2026-05-30 · unverdicted · none · ref 39 · internal anchor
FRAMES uses Moreau envelope smoothing with Frank-Wolfe steps for nonsmooth nonconvex problems, proving convergence rates under mild assumptions and highlighting a new gap relationship.
Distance-Aware Muon: Adaptive Step Scaling for Normalized Optimization cs.LG · 2026-05-18 · unverdicted · none · ref 1 · internal anchor
Introduces Distance-Adaptive Muon, Scale-Calibrated Muon, and Distance-Free Muon with stationarity and O(1/T) objective-gap guarantees, shown to match or improve fixed-scale Muon on GPT-124M and ViT-Tiny models.
Elastic Attention Cores for Scalable Vision Transformers cs.CV · 2026-05-12 · unverdicted · none · ref 167 · internal anchor
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.
PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix Optimisation cs.LG · 2026-05-08 · unverdicted · none · ref 31 · internal anchor
PolarAdamW disentangles spectral control from gauge-equivariance in matrix optimizers, with experiments demonstrating their distinct roles on standard versus symmetry-aware neural networks.
Parcae: Scaling Laws For Stable Looped Language Models cs.LG · 2026-04-14 · unverdicted · none · ref 3 · internal anchor
Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration cs.LG · 2026-03-30 · unverdicted · none · ref 54 · internal anchor
MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods cs.LG · 2025-10-12 · unverdicted · none · ref 2 · internal anchor
Preconditioned matrix norms unify steepest descent, quasi-Newton, and adaptive optimizers, revealing SGD, Adam, Muon, KL-Shampoo, SOAP, and SPlus as special cases and enabling new methods MuAdam and MuAdam-SANIA that are competitive in experiments.
Softsign: Smooth Sign in Your Optimizer For Better Parameter Heterogeneity Handling cs.LG · 2026-05-29 · unverdicted · none · ref 4 · internal anchor
SoftSignum replaces hard sign with soft-sign in optimizers via temperature control and quantile scheduling, extends to SoftMuon, provides a convergence proof for stochastic non-convex settings, and reports better performance than sign-based methods and AdamW on deep learning tasks.
Convergence of Spectral Descent for Non-smooth Optimization cs.LG · 2026-05-26 · unverdicted · none · ref 2 · internal anchor
Proves linear convergence of Spectral Descent (SD) and Truncated SD for non-smooth convex problems under stated conditions, sublinear rates for regularized versions via Frank-Wolfe, and recovery guarantees for robust low-rank matrix recovery.
Anytime Training with Schedule-Free Spectral Optimization cs.LG · 2026-05-21 · unverdicted · none · ref 45 · internal anchor
SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation cs.LG · 2026-05-12 · unverdicted · none · ref 2 · internal anchor
Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives math.OC · 2026-05-12 · unverdicted · none · ref 2 · internal anchor
Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.
A Note on Stability for Orthogonalized Matrix Momentum with Client Sampling cs.LG · 2026-06-01 · unverdicted · none · ref 12 · internal anchor
Derives finite-round upper-tail guarantee on population-empirical gap for client-sampled orthogonalized matrix momentum under heterogeneous data, with Lipschitz condition on the orthogonalizer.
MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training cs.LG · 2026-02-25 · unverdicted · none · ref 2 · internal anchor
Muon+ adds one normalization step after polar orthogonalization in the Muon optimizer, yielding lower training and validation perplexity and faster pre-training across 60M-7B models.
Multi-Gate Residuals cs.LG · 2026-05-22 · unverdicted · none · ref 7 · internal anchor
Multi-Gate Residuals stabilizes activation scales in deep residual networks via multi-stream gating and attention pooling without added communication overhead.

The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer