Recognition: 2 theorem links
· Lean TheoremConvergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning
Pith reviewed 2026-05-16 15:33 UTC · model grok-4.3
The pith
AdamW-style Shampoo unifies one-sided and two-sided preconditioning to achieve an SGD-like convergence rate in nuclear norm.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our analysis unifies one-sided and two-sided preconditioning and establishes the convergence rate 1 over K sum from k equals 1 to K of E of the nuclear norm of nabla f of X sub k is less than or equal to order sqrt of m plus n times C over K to the power 1/4. The paper notes that the nuclear norm and Frobenius norm satisfy the sandwich inequality with the sqrt(m plus n) factor, supporting that the rate is analogous to the optimal SGD rate of order C over K to the power 1/4 when the nuclear norm is Theta of sqrt(m plus n) times the Frobenius norm.
What carries the argument
The AdamW-style Shampoo preconditioner, a specific matrix preconditioning form applied to stochastic gradients that combines elements of AdamW weight decay with Shampoo-style updates.
Load-bearing premise
The analysis assumes the objective satisfies standard stochastic optimization conditions such as smoothness and bounded gradient variance.
What would settle it
An experiment in which the average nuclear norm of gradients over many iterations decreases slower than order 1 over K to the power 1/4 for large K would contradict the claimed bound.
read the original abstract
This paper studies the AdamW-style Shampoo optimizer, an effective implementation of classical Shampoo that notably won the external tuning track of the AlgoPerf neural network training algorithm competition. Our analysis unifies one-sided and two-sided preconditioning and establishes the convergence rate $\frac{1}{K}\sum_{k=1}^K E\left[\|\nabla f(X_k)\|_*\right]\leq O(\frac{\sqrt{m+n}C}{K^{1/4}})$ measured by nuclear norm, where $K$ represents the iteration number, $(m,n)$ denotes the size of matrix parameters, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $\|\nabla f(X)\|_F\leq \|\nabla f(X)\|_*\leq \sqrt{m+n}\|\nabla f(X)\|_F$, supporting that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K}\sum_{k=1}^KE\left[\|\nabla f(X_k)\|_F\right]\leq O(\frac{C}{K^{1/4}})$ convergence rate of SGD in the ideal case of $\|\nabla f(X)\|_*= \Theta(\sqrt{m+n})\|\nabla f(X)\|_F$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes the AdamW-style Shampoo optimizer, unifying one-sided and two-sided preconditioning for matrix parameters. It derives the convergence rate (1/K) sum_{k=1}^K E[||∇f(X_k)||_*] ≤ O(√(m+n) C / K^{1/4}) in the nuclear norm, where C matches the constant from the optimal SGD rate under standard stochastic non-convex assumptions (smoothness and bounded variance). The bound follows from the descent lemma applied to the preconditioned update together with the elementary norm relation ||G||_F ≤ ||G||_* ≤ √(m+n) ||G||_F.
Significance. If the derivation is correct, the result supplies a clean theoretical explanation for the empirical performance of Shampoo (winner of the external tuning track in AlgoPerf) by showing that its rate is essentially the SGD rate scaled by the natural √(m+n) factor arising from the nuclear-to-Frobenius norm comparison. The unification of one- and two-sided preconditioners under a common AdamW-style form is a modest but useful organizational contribution to the literature on adaptive second-order methods.
minor comments (3)
- [§2] §2 (or wherever the precise update rule is stated): the AdamW-style modification (weight decay placement relative to the preconditioner) should be written explicitly as an equation so that the reader can verify it matches the one-sided/two-sided unification claim without ambiguity.
- [Introduction] The abstract and introduction both invoke the relation ||∇f||_* = Θ(√(m+n)) ||∇f||_F to claim analogy with SGD; a short remark on when this scaling is tight (e.g., dense vs. low-rank gradients) would strengthen the interpretation.
- [Theorem 1] The proof sketch in the main text relies on the standard descent lemma and variance bound; adding a single sentence that lists the exact assumptions (L-smoothness, σ²-bounded variance, etc.) would make the result self-contained.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation of minor revision. The summary accurately reflects the paper's contributions on unifying one- and two-sided preconditioning and deriving the nuclear-norm convergence rate analogous to SGD.
read point-by-point responses
-
Referee: The manuscript analyzes the AdamW-style Shampoo optimizer, unifying one-sided and two-sided preconditioning for matrix parameters. It derives the convergence rate (1/K) sum_{k=1}^K E[||∇f(X_k)||_*] ≤ O(√(m+n) C / K^{1/4}) in the nuclear norm, where C matches the constant from the optimal SGD rate under standard stochastic non-convex assumptions (smoothness and bounded variance). The bound follows from the descent lemma applied to the preconditioned update together with the elementary norm relation ||G||_F ≤ ||G||_* ≤ √(m+n) ||G||_F.
Authors: We appreciate the referee's precise summary of the main result. The proof indeed proceeds by applying the standard descent lemma to the AdamW-style update (which incorporates the preconditioner in the stated form) and then invoking the elementary relation between Frobenius and nuclear norms to obtain the √(m+n) factor that aligns the bound with the optimal SGD rate under the same assumptions. revision: no
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper's central convergence bound is obtained by applying the standard descent lemma and variance bound from non-convex stochastic optimization to the AdamW-style preconditioner, then invoking the elementary norm relation ||G||_F ≤ ||G||_* ≤ √(m+n) ||G||_F. The constant C is taken directly from the known SGD rate under matching smoothness and variance assumptions, with no fitting of parameters to the target result, no self-definitional steps, and no load-bearing self-citations that reduce the claim to its own inputs. The unification of one- and two-sided preconditioning occurs via an explicit common update form that does not presuppose the nuclear-norm rate. The analysis is therefore independent of the claimed output and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
free parameters (1)
- C
axioms (1)
- domain assumption Standard assumptions of stochastic optimization (smoothness, bounded variance or equivalent)
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 2.1 ... 1/K sum E[||nabla f(X_k)||_*] <= O(sqrt(m+n) C / K^{1/4}) ... Schatten-p Holder inequality (Lemma 3.2)
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lemma 3.5 (matrix Cauchy-Schwarz) ... L^alpha M R^{1-alpha} op bound
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Convergence Rate Analysis of SOAP with Arbitrary Orthogonal Projection Matrices
SOAP and its generalizations with arbitrary orthogonal projections converge at a provable rate when the projections are conditionally independent of the current gradient.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.