pith. sign in

hub Canonical reference

Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005

Canonical reference. 80% of citing Pith papers cite this work as background.

29 Pith papers citing it
Background 80% of classified citations

hub tools

citation-role summary

background 4 method 1

citation-polarity summary

years

2026 27 2025 2

clear filters

representative citing papers

Why Muon Outperforms Adam: A Curvature Perspective

cs.LG · 2026-06-03 · conditional · novelty 7.0

Muon outperforms Adam by reducing curvature penalty via lower Normalized Directional Sharpness, as shown via Taylor approximation on LLM training and proven on stylized quadratic problems with heterogeneous curvature.

AMUSE: Anytime Muon with Stable Gradient Evaluation

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.

Accelerating LMO-Based Optimization via Implicit Gradient Transport

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

LMO-IGT achieves O(ε^{-3.5}) iteration complexity for stochastic LMO optimization via implicit gradient transport with a single gradient per step and introduces the regularized support function as a unified stationarity measure.

On the Convergence of Muon and Beyond

cs.LG · 2025-09-19 · unverdicted · novelty 7.0

Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.

Aurora: A Leverage-Aware Spectral Optimizer

cs.LG · 2026-06-26 · unverdicted · novelty 6.0

Aurora is a leverage-aware spectral optimizer that enforces uniform row norms in matrix updates while preserving Muon's polar geometry, outperforming Muon and achieving SOTA among spectral methods on modded-nanoGPT.

FOGO: Forgetting-aware Orthogonalization Optimizer

cs.LG · 2026-06-09 · unverdicted · novelty 6.0

FOGO introduces spectral orthogonalization of momentum updates plus a random-projection codebook memory to detect and correct gradient interference, improving convergence and retention over Adam and Muon on imbalanced, continual, and large-model tasks.

OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training loss than Muon+Moonlight and AdamW.

Parcae: Scaling Laws For Stable Looped Language Models

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.

Muon Learns More Robust and Transferable Features than Adam

cs.LG · 2026-06-08 · unverdicted · novelty 5.0

Muon learns more robust and transferable features than Adam and SGD, shown via corruption robustness tests, transfer experiments, layer-wise probes, effective rank measurements, and a theoretical proof on margins in a multi-component classification problem.

Convergence of Spectral Descent for Non-smooth Optimization

cs.LG · 2026-05-26 · unverdicted · novelty 5.0

Proves linear convergence of Spectral Descent (SD) and Truncated SD for non-smooth convex problems under stated conditions, sublinear rates for regularized versions via Frank-Wolfe, and recovery guarantees for robust low-rank matrix recovery.

citing papers explorer

Showing 4 of 4 citing papers after filters.