arxiv: 2601.07326 · v2 · submitted 2026-01-12 · 🧮 math.OC · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

Huan Li , Yiming Dong , Zhouchen Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-16 15:33 UTC · model grok-4.3

classification 🧮 math.OC cs.LG

keywords Shampoo optimizerAdamWconvergence ratepreconditioningstochastic optimizationnuclear normmatrix parameters

0 comments

The pith

AdamW-style Shampoo unifies one-sided and two-sided preconditioning to achieve an SGD-like convergence rate in nuclear norm.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a convergence bound for the AdamW-style Shampoo optimizer showing that the average expected nuclear norm of the gradient over K steps is at most order sqrt(m plus n) times C over K to the power 1/4, where m and n are the dimensions of the matrix parameter. This unifies earlier separate treatments of one-sided and two-sided preconditioners under a single analysis. A reader would care because the bound matches the form of the optimal SGD rate when the nuclear norm and Frobenius norm are comparable, offering theoretical support for an optimizer that performed strongly in external tuning competitions. The paper also recalls the norm relation that the Frobenius norm is at most the nuclear norm which is at most sqrt(m plus n) times the Frobenius norm, so the result is consistent with known SGD behavior in the ideal case.

Core claim

Our analysis unifies one-sided and two-sided preconditioning and establishes the convergence rate 1 over K sum from k equals 1 to K of E of the nuclear norm of nabla f of X sub k is less than or equal to order sqrt of m plus n times C over K to the power 1/4. The paper notes that the nuclear norm and Frobenius norm satisfy the sandwich inequality with the sqrt(m plus n) factor, supporting that the rate is analogous to the optimal SGD rate of order C over K to the power 1/4 when the nuclear norm is Theta of sqrt(m plus n) times the Frobenius norm.

What carries the argument

The AdamW-style Shampoo preconditioner, a specific matrix preconditioning form applied to stochastic gradients that combines elements of AdamW weight decay with Shampoo-style updates.

Load-bearing premise

The analysis assumes the objective satisfies standard stochastic optimization conditions such as smoothness and bounded gradient variance.

What would settle it

An experiment in which the average nuclear norm of gradients over many iterations decreases slower than order 1 over K to the power 1/4 for large K would contradict the claimed bound.

read the original abstract

This paper studies the AdamW-style Shampoo optimizer, an effective implementation of classical Shampoo that notably won the external tuning track of the AlgoPerf neural network training algorithm competition. Our analysis unifies one-sided and two-sided preconditioning and establishes the convergence rate $\frac{1}{K}\sum_{k=1}^K E\left[\|\nabla f(X_k)\|_*\right]\leq O(\frac{\sqrt{m+n}C}{K^{1/4}})$ measured by nuclear norm, where $K$ represents the iteration number, $(m,n)$ denotes the size of matrix parameters, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $\|\nabla f(X)\|_F\leq \|\nabla f(X)\|_*\leq \sqrt{m+n}\|\nabla f(X)\|_F$, supporting that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K}\sum_{k=1}^KE\left[\|\nabla f(X_k)\|_F\right]\leq O(\frac{C}{K^{1/4}})$ convergence rate of SGD in the ideal case of $\|\nabla f(X)\|_*= \Theta(\sqrt{m+n})\|\nabla f(X)\|_F$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript analyzes the AdamW-style Shampoo optimizer, unifying one-sided and two-sided preconditioning for matrix parameters. It derives the convergence rate (1/K) sum_{k=1}^K E[||∇f(X_k)||_*] ≤ O(√(m+n) C / K^{1/4}) in the nuclear norm, where C matches the constant from the optimal SGD rate under standard stochastic non-convex assumptions (smoothness and bounded variance). The bound follows from the descent lemma applied to the preconditioned update together with the elementary norm relation ||G||_F ≤ ||G||_* ≤ √(m+n) ||G||_F.

Significance. If the derivation is correct, the result supplies a clean theoretical explanation for the empirical performance of Shampoo (winner of the external tuning track in AlgoPerf) by showing that its rate is essentially the SGD rate scaled by the natural √(m+n) factor arising from the nuclear-to-Frobenius norm comparison. The unification of one- and two-sided preconditioners under a common AdamW-style form is a modest but useful organizational contribution to the literature on adaptive second-order methods.

minor comments (3)

[§2] §2 (or wherever the precise update rule is stated): the AdamW-style modification (weight decay placement relative to the preconditioner) should be written explicitly as an equation so that the reader can verify it matches the one-sided/two-sided unification claim without ambiguity.
[Introduction] The abstract and introduction both invoke the relation ||∇f||_* = Θ(√(m+n)) ||∇f||_F to claim analogy with SGD; a short remark on when this scaling is tight (e.g., dense vs. low-rank gradients) would strengthen the interpretation.
[Theorem 1] The proof sketch in the main text relies on the standard descent lemma and variance bound; adding a single sentence that lists the exact assumptions (L-smoothness, σ²-bounded variance, etc.) would make the result self-contained.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. The summary accurately reflects the paper's contributions on unifying one- and two-sided preconditioning and deriving the nuclear-norm convergence rate analogous to SGD.

read point-by-point responses

Referee: The manuscript analyzes the AdamW-style Shampoo optimizer, unifying one-sided and two-sided preconditioning for matrix parameters. It derives the convergence rate (1/K) sum_{k=1}^K E[||∇f(X_k)||_*] ≤ O(√(m+n) C / K^{1/4}) in the nuclear norm, where C matches the constant from the optimal SGD rate under standard stochastic non-convex assumptions (smoothness and bounded variance). The bound follows from the descent lemma applied to the preconditioned update together with the elementary norm relation ||G||_F ≤ ||G||_* ≤ √(m+n) ||G||_F.

Authors: We appreciate the referee's precise summary of the main result. The proof indeed proceeds by applying the standard descent lemma to the AdamW-style update (which incorporates the preconditioner in the stated form) and then invoking the elementary relation between Frobenius and nuclear norms to obtain the √(m+n) factor that aligns the bound with the optimal SGD rate under the same assumptions. revision: no

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's central convergence bound is obtained by applying the standard descent lemma and variance bound from non-convex stochastic optimization to the AdamW-style preconditioner, then invoking the elementary norm relation ||G||_F ≤ ||G||_* ≤ √(m+n) ||G||_F. The constant C is taken directly from the known SGD rate under matching smoothness and variance assumptions, with no fitting of parameters to the target result, no self-definitional steps, and no load-bearing self-citations that reduce the claim to its own inputs. The unification of one- and two-sided preconditioning occurs via an explicit common update form that does not presuppose the nuclear-norm rate. The analysis is therefore independent of the claimed output and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The analysis rests on standard stochastic-optimization assumptions whose precise statements are not supplied in the abstract; C is carried over from the SGD literature rather than fitted here.

free parameters (1)

C
Constant taken from the optimal SGD convergence rate; appears in the stated bound.

axioms (1)

domain assumption Standard assumptions of stochastic optimization (smoothness, bounded variance or equivalent)
Required for any convergence proof of this type; invoked implicitly to obtain the rate.

pith-pipeline@v0.9.0 · 5538 in / 1312 out tokens · 32885 ms · 2026-05-16T15:33:28.762115+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2.1 ... 1/K sum E[||nabla f(X_k)||_*] <= O(sqrt(m+n) C / K^{1/4}) ... Schatten-p Holder inequality (Lemma 3.2)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 3.5 (matrix Cauchy-Schwarz) ... L^alpha M R^{1-alpha} op bound

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Convergence Rate Analysis of SOAP with Arbitrary Orthogonal Projection Matrices
math.OC 2026-04 unverdicted novelty 7.0

SOAP and its generalizations with arbitrary orthogonal projections converge at a provable rate when the projections are conditionally independent of the current gradient.