From SGD to Muon: Adaptive Optimization via Schatten-p Norms

Corentin Friedrich; DTIPG - SNCF; Mathieu Serrurier (IRIT); Thomas Massena (IRIT; UT3)

arxiv: 2605.19781 · v1 · pith:PVWGRL77new · submitted 2026-05-19 · 💻 cs.AI

From SGD to Muon: Adaptive Optimization via Schatten-p Norms

Thomas Massena (IRIT , DTIPG - SNCF , UT3) , Corentin Friedrich , Mathieu Serrurier (IRIT) This is my paper

Pith reviewed 2026-05-20 06:13 UTC · model grok-4.3

classification 💻 cs.AI

keywords adaptive optimizationLMO geometrySchatten-p normsMuon optimizerdata-driven criteriongradient statisticsneural network trainingrandom feature regression

0 comments

The pith

A data-driven criterion selects proxy-optimal LMO geometries per neural network layer by interpolating between SGD and Muon using gradient and activation statistics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a closed-form method to pick update geometries dynamically for each layer instead of fixing them in advance. A single-step random feature regression surrogate model computes this choice from runtime gradient and activation data, creating an optimizer that moves continuously from SGD-style to Muon-style updates. Adding parameter-wise preconditioning recovers Adam and MuAdam as boundary cases while keeping overhead near three percent. Experiments show the resulting adaptive optimizer matches or exceeds the better of static Muon and AdamW across three training setups. The work demonstrates that LMO geometry can be treated as a learnable choice rather than a fixed design decision.

Core claim

The central claim is that a closed-form criterion derived from a single-step random feature regression surrogate model can select proxy-optimal LMO geometries for individual DNN layers based on gradient and activation statistics, allowing an adaptive optimizer that interpolates between SGD and Muon updates and recovers SGD, Muon, Adam, and MuAdam as special cases when combined with parameter-wise preconditioning, while incurring only modest runtime overhead and delivering competitive performance.

What carries the argument

The data-driven criterion for choosing LMO geometry, derived in closed form from gradient and activation statistics via a single-step random feature regression surrogate model that interpolates across Schatten-p norms.

If this is right

The framework unifies SGD, Muon, Adam, and MuAdam as extrema of the same interpolation.
Runtime overhead remains near three percent relative to highly optimized baselines.
Performance is at least as good as the stronger of Muon or AdamW in the tested scenarios.
LMO geometry selection becomes a runtime data-driven decision rather than a static design choice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same surrogate approach could be tested on non-vision tasks where layer geometries differ sharply.
Continuous interpolation over Schatten-p norms might allow smooth scheduling of geometry during a single training run.
If the criterion generalizes, similar data-driven selection could be applied to other matrix-norm constrained optimizers outside the current design space.

Load-bearing premise

The single-step random feature regression surrogate model accurately derives a proxy-optimal LMO geometry choice from gradient and activation statistics for individual layers.

What would settle it

Training runs in which the adaptive optimizer underperforms both fixed Muon and fixed AdamW across all three reported scenarios would show that the data-driven geometry selection does not deliver the claimed performance benefit.

read the original abstract

Modern optimizers, like Muon, impose matrix-wise geometry constraints on their updates. These matrix-wise constraints can be unified under Linear Minimization Oracle (LMO) theory. However, all current methods impose fixed LMO geometries for the update rules, chosen by-design or empirically, which are not necessarily optimal according to the problem's geometry. We introduce a novel efficient datadriven criterion for dynamically choosing proxy-optimal update LMO geometries on individual Deep Neural Network layers. Derived in closed form from gradient and activation statistics using a single-step random feature regression surrogate model, our criterion navigates a design space interpolating from SGD to Muon updates. Moreover, integrating parameter-wise preconditioning allows our framework to recover SGD, Muon, Adam, and MuAdam as specific extrema. To make this adaptive approach scalable, we pair it with efficient computational strategies, achieving only a $\sim$ 3% runtime overhead on highly optimized baselines. As a proof of concept, we show that this data-driven optimizer beats or remains competitive with the performance of the best performing optimizer between Muon and AdamW across three different training scenarios. Ultimately, this work provides evidence that LMO geometry can be successfully and efficiently adapted from runtime data, opening a new pathway for optimizer design beyond static geometries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a data-driven way to pick per-layer LMO geometries with a random-feature surrogate, but only the abstract exists so the derivation and results stay uncheckable.

read the letter

This paper's main move is to replace fixed LMO geometries with a runtime choice per layer. It derives a closed-form criterion from gradient and activation statistics through a single-step random feature regression, then uses that to interpolate between SGD-style and Muon-style updates. Adding parameter-wise preconditioning lets the same setup recover Adam and MuAdam as special cases. The abstract also reports roughly 3% overhead and claims the resulting optimizer matches or beats the better of Muon and AdamW in three training scenarios. That unification and the low-cost adaptation are the concrete contributions on offer.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes a data-driven adaptive optimizer that dynamically selects proxy-optimal Linear Minimization Oracle (LMO) geometries per layer by deriving a closed-form criterion from gradient and activation statistics via a single-step random feature regression surrogate model. The framework interpolates between SGD and Muon updates, recovers SGD, Muon, Adam, and MuAdam as special cases when combined with parameter-wise preconditioning, incurs only ~3% runtime overhead, and is claimed to beat or match the best of Muon and AdamW across three unspecified training scenarios.

Significance. If the central claims hold, the work would provide evidence that LMO geometry can be successfully adapted from runtime data, offering a new pathway for optimizer design beyond static choices. The unification under LMO theory, the closed-form derivation, and the recovery of existing methods as extrema are conceptual strengths that could influence future adaptive optimization research.

major comments (1)

Abstract: the performance claim that the adaptive optimizer 'beats or remains competitive with the best performing optimizer between Muon and AdamW' rests on an unexamined surrogate model and three unspecified scenarios; without the explicit derivation of the closed-form criterion or experimental protocols, it is impossible to assess whether the single-step random feature regression surrogate accurately grounds the proxy-optimal LMO choice or introduces circularity.

minor comments (2)

Abstract: 'datadriven' should be written as 'data-driven'.
Abstract: the three training scenarios are not identified, which limits evaluation of the generality of the competitive performance result.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment below.

read point-by-point responses

Referee: [—] Abstract: the performance claim that the adaptive optimizer 'beats or remains competitive with the best performing optimizer between Muon and AdamW' rests on an unexamined surrogate model and three unspecified scenarios; without the explicit derivation of the closed-form criterion or experimental protocols, it is impossible to assess whether the single-step random feature regression surrogate accurately grounds the proxy-optimal LMO choice or introduces circularity.

Authors: We appreciate the referee raising this point regarding the abstract. The manuscript provides the explicit derivation of the closed-form criterion in the main text, obtained from the single-step random feature regression surrogate model applied to per-layer gradient and activation statistics. This choice is proxy-optimal in the sense that it selects the LMO geometry that minimizes a surrogate objective for the update direction. There is no circularity because the statistics are computed from the current or previous step to inform the geometry for the next update. The three training scenarios are detailed in the Experiments section, where direct comparisons to Muon and AdamW are performed across these setups, showing the adaptive method matches or exceeds the best of the two. We can revise the abstract to be more specific about the scenarios if space permits, or ensure the main text makes the protocols clearer. revision: partial

Circularity Check

0 steps flagged

No circularity detectable; derivation chain unverifiable from abstract alone

full rationale

Only the abstract is available, which states that the criterion is 'Derived in closed form from gradient and activation statistics using a single-step random feature regression surrogate model' and that the framework 'recovers SGD, Muon, Adam, and MuAdam as specific extrema.' No equations, derivations, self-citations, or uniqueness theorems are provided in the text. Without specific load-bearing steps or reductions that can be quoted and shown to equal inputs by construction, no circularity of any enumerated kind can be exhibited. The abstract presents the method as an independent data-driven interpolation over a design space, which is consistent with a self-contained derivation rather than a renaming or fitted-input prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review prevents exhaustive audit; the approach assumes LMO theory unifies matrix constraints and that gradient/activation statistics suffice for proxy-optimal geometry selection.

axioms (1)

domain assumption Linear Minimization Oracle (LMO) theory unifies matrix-wise geometry constraints on optimizer updates
Invoked to frame all current methods and the new adaptive criterion.

pith-pipeline@v0.9.0 · 5741 in / 1150 out tokens · 52897 ms · 2026-05-20T06:13:43.111627+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Derived in closed form from gradient and activation statistics using a single-step random feature regression surrogate model, our criterion navigates a design space interpolating from SGD to Muon updates.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

integrating parameter-wise preconditioning allows our framework to recover SGD, Muon, Adam, and MuAdam as specific extrema

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.