arxiv: 2604.21616 · v2 · submitted 2026-04-23 · 🧮 math.OC

Recognition: unknown

Convergence Rate Analysis of SOAP with Arbitrary Orthogonal Projection Matrices

Huan Li , Zhouchen Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:19 UTC · model grok-4.3

classification 🧮 math.OC

keywords convergence rateSOAP optimizerorthogonal projection matricesstochastic gradientmatrix-based optimizerconditional independencedeep neural network trainingoptimization

0 comments

The pith

SOAP optimizer receives its first convergence rate proof even with arbitrary orthogonal projection matrices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This short note supplies the first rigorous convergence analysis for SOAP, a matrix-based optimizer commonly used to train deep neural networks. The result covers a generalized form of the algorithm that accepts any orthogonal projection matrices provided they satisfy a mild conditional-independence requirement with respect to the current stochastic gradient. Because the matrices may be built from information available in earlier steps, the analysis removes the need for projections that are chosen independently of all past data. The contribution therefore widens the set of practical projection choices that come with theoretical backing.

Core claim

We establish, for the first time, the convergence rate of SOAP. Our analysis extends to a more general variant of SOAP that admits arbitrary orthogonal projection matrices and requires only that these matrices be conditionally independent of the current stochastic gradient at each iteration. For example, they may be constructed from information available up to the preceding step.

What carries the argument

Arbitrary orthogonal projection matrices required only to be conditionally independent of the current stochastic gradient.

If this is right

Any orthogonal projection matrix built solely from past information inherits the same convergence guarantee.
The original SOAP algorithm is recovered as a special case of the analyzed variant.
Convergence holds without requiring the projections to be chosen from a fixed finite set or to satisfy stronger independence conditions.
The proof technique applies directly to other matrix-based first-order methods that employ similar projections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Implementers can safely incorporate historical gradient or curvature information when forming the projections without losing the convergence guarantee.
The conditional-independence lens may be useful for analyzing other adaptive matrix optimizers that reuse earlier computations.
Empirical checks could test whether common practical choices of projection matrices already satisfy the stated independence condition in typical training runs.

Load-bearing premise

The projection matrices must remain conditionally independent of the stochastic gradient computed at the same iteration.

What would settle it

A low-dimensional quadratic minimization experiment in which the projection matrix is deliberately made to depend on the current gradient and the observed convergence rate falls below the proven bound.

read the original abstract

In this short note, we establish, for the first time, the convergence rate of SOAP, an efficient and popular matrix-based optimizer for training deep neural networks. Our analysis extends to a more general variant of SOAP that admits arbitrary orthogonal projection matrices and requires only that these matrices be conditionally independent of the current stochastic gradient at each iteration. For example, they may be constructed from information available up to the preceding step.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This short note supplies the first convergence-rate proof for SOAP by extending it to arbitrary orthogonal projections under a conditional-independence assumption.

read the letter

This short note supplies the first convergence-rate proof for SOAP by extending it to arbitrary orthogonal projections under a conditional-independence assumption. The authors show the rate holds when the projection matrices can be built from prior steps, as long as they stay independent of the current stochastic gradient. That relaxation is the main new element compared with earlier, more restrictive analyses of the same optimizer. The paper does well by keeping the claim narrow and stating the key assumption clearly up front, which lets the derivation focus on what actually matters for the bound. It gives a usable theoretical anchor for a method that already sees practical use in neural network training. The main soft spot is the conditional-independence condition itself. The note does not check how often this holds in real runs or include any supporting experiments, so the rate remains an ideal-case guarantee. Being a brief theoretical piece with no code, data, or parameter tuning, its value rests entirely on whether the math is tight and the assumption is plausible. No circularity or hidden stronger requirements show up in the setup. This is for researchers who track convergence results for adaptive optimizers such as Adam variants. Someone extending analyses of matrix-based methods in deep learning could pick up the generalization and build on it. I would send it to peer review. The scope is limited enough that referees can check the derivation directly, and it fills a documented gap even if revisions are needed to clarify the assumption's reach.

Referee Report

0 major / 1 minor

Summary. In this short note, the authors establish the convergence rate of the SOAP optimizer for the first time. The analysis is extended to a general variant allowing arbitrary orthogonal projection matrices, provided they are conditionally independent of the current stochastic gradient at each iteration. Examples include constructing them from information up to the preceding step.

Significance. Should the convergence rate derivation prove correct under the conditional independence assumption, this contribution would be significant as it provides the first theoretical guarantee for SOAP, a widely used optimizer in deep learning. The generalization to arbitrary projections with only this weak assumption is a notable strength, as it does not require strong independence or specific constructions, potentially enabling more flexible implementations. The paper's focus on a parameter-free derivation under minimal axioms is commendable.

minor comments (1)

[Abstract] Consider specifying the exact form of the established convergence rate (e.g., linear, sublinear) to immediately convey the strength of the result to readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive summary of our manuscript and the recommendation for minor revision. We appreciate the recognition that our work provides the first theoretical convergence guarantee for SOAP under the conditional independence assumption, along with the generalization to arbitrary orthogonal projections.

read point-by-point responses

Referee: The report provides a positive overall assessment but lists no specific major comments or concerns about the derivation, assumptions, or results.

Authors: We are pleased that the referee finds the contribution significant and notes the strength of the weak conditional independence assumption. No specific issues were raised that require rebuttal. We will incorporate any minor suggestions (e.g., clarifications or additional examples) in the revised version while preserving the parameter-free nature of the analysis. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is a short theoretical note that derives a convergence rate for SOAP (and its generalization) directly from the explicit assumption that the orthogonal projection matrices are conditionally independent of the current stochastic gradient. The abstract states the result is established 'for the first time' under this minimal condition, with no fitted parameters, no self-citations invoked as load-bearing premises, and no reduction of the claimed rate to a definition or prior result by construction. The derivation chain is therefore self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on a single domain assumption about conditional independence of the projection matrices; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption The orthogonal projection matrices are conditionally independent of the current stochastic gradient at each iteration
Explicitly required by the analysis as stated in the abstract.

pith-pipeline@v0.9.0 · 5349 in / 1033 out tokens · 29045 ms · 2026-05-09T21:19:08.494896+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 5 canonical work pages · 2 internal anchors

[1]

International Conference on Learning Representations (ICLR) , year=

Adam: a method for stochastic optimization , author=. International Conference on Learning Representations (ICLR) , year=
[2]

Journal of Machine Learning Research , volume=

Adaptive subgradient methods for online learning and stochastic optimization , author=. Journal of Machine Learning Research , volume=
[3]

Conference on Learning Theory (COLT) , year=

Adaptive Bound Optimization for Online Convex Optimization , author=. Conference on Learning Theory (COLT) , year=
[4]

Lecture 6.5-

Tijmen Tieleman and Geoffrey Hinton , booktitle=. Lecture 6.5-
[5]

International Conference on Learning Representations (ICLR) , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations (ICLR) , year=
[6]

2024 , howpublished =

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , howpublished =

2024
[7]

Nikhil Vyas and Depen Morwani and Rosie Zhao and Mujin Kwun and Itai Shapira and David Brandfonbrener and Lucas Janson and Sham Kakade , booktitle=
[8]

Huan Li and Yiming Dong and Zhouchen Lin , booktitle=. On the
[9]

Huan Li and Yiming Dong and Zhouchen Lin , journal=. On the
[10]

2018 , booktitle =

Vineet Gupta and Tomer Koren and Yoram Singer , title =. 2018 , booktitle =

2018
[11]

A simple convergence proof of

Alexandre D\'. A simple convergence proof of. Transactions on Machine Learning Research , year=
[12]

Naichen Shi and Dawei Li and Mingyi Hong and Ruoyu Sun , booktitle=
[13]

Conference on Neural Information Processing Systems (NeurIPS) , year=

Adam can converge without any modification on update rules , author=. Conference on Neural Information Processing Systems (NeurIPS) , year=
[14]

On Convergence of

Yusu Hong and Junhong Lin , booktitle=. On Convergence of
[15]

Convergence of

Haochuan Li and Alexander Rakhlin and Ali Jadbabaie , booktitle=. Convergence of
[16]

Jiaxiang Li and Mingyi Hong

Jiaxiang Li and Mingyi Hong. A Note on the Convergence of M uon. arXiv:2502.02900. 2025

work page arXiv 2025
[17]

Muon Optimizes Under Spectral Norm Constraints

Lizhang Chen and Jonathan Li and Qiang Liu. Muon Optimizes Under Spectral Norm Constraints. Transactions on Machine Learning Research. 2026

2026
[18]

On the Convergence Analysis of Muon

Wei Shen and Ruichuan Huang and Minhui Huang and Cong Shen and Jiawei Zhang. On the Convergence Analysis of M uon. arXiv:2505.23737. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

arXiv preprint arXiv:2507.01598 , year =

Naoki Sato and Hiroki Naganuma and Hideaki Iiduka. Convergence Bound and Critical Batch Size of M uon Optimizer. arXiv:2507.01598. 2025

work page arXiv 2025
[20]

Conference on Neural Information Processing Systems (NeurIPS) , year=

A Stable Whitening Optimizer for Efficient Neural Network Training , author=. Conference on Neural Information Processing Systems (NeurIPS) , year=
[21]

ARO : A New Lens On Matrix Optimization For Large Models

Wenbo Gong and Javier Zazo and Qijun Luo and Puqian Wang and James Hensman and Chao Ma. ARO : A New Lens On Matrix Optimization For Large Models. arXiv:2602.09006. 2026

work page arXiv 2026
[22]

Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

Huan Li and Yiming Dong and Zhouchen Lin. Convergence Rate Analysis of the A damW-Style S hampoo: Unifying One-sided and Two-Sided Preconditioning. arXiv:2601.07326. 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

2025 , booktitle =

Yutong He and Pengrui Li and Yipeng Hu and Chuyan Chen and Kun Yuan , title =. 2025 , booktitle =

2025
[24]

2024 , booktitle =

Jiawei Zhao and Zhenyu Zhang and Beidi Chen and Zhangyang Wang and Anima Anandkumar and Yuandong Tian , title =. 2024 , booktitle =

2024
[25]

SIAM Review , volume=

Optimization Methods for Large-Scale Machine Learning , author=. SIAM Review , volume=
[26]

Mathematical Programming , volume=

Lower bounds for non-convex stochastic optimization , author=. Mathematical Programming , volume=