arxiv: 2605.07067 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: no theorem link

PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix Optimisation

Haozhou Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:26 UTC · model grok-4.3

classification 💻 cs.LG

keywords PolarAdamWMuon optimizerspectral controlSchur gauge-equivariancematrix optimizationtransformer trainingequivariant modelsNewton-Schulz

0 comments

The pith

PolarAdamW isolates spectral control from Schur gauge-equivariance to test their separate effects in matrix optimization

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PolarAdamW to separate two mechanisms that Muon combines in its matrix-level updates: spectral-norm control through a polar map, and equivariance under changes of basis in multiplicity spaces known as Schur gauge-equivariance. PolarAdamW applies the polar map after AdamW preconditioning, keeping the first property while dropping the second because the preconditioner depends on the chosen basis. On DeiT-Tiny trained from scratch on 100-class ImageNet subsets, where multiplicity freedom is trivial, PolarAdamW improves test accuracy over Muon by 1.93 percentage points on average and over AdamW by 9.5 points. On SO(3)-equivariant 3D point-cloud regression, where multiplicity-basis freedom is structurally present, the ordering reverses and Muon leads at every tested capacity, although both polar methods still beat AdamW. The experiments demonstrate that the two properties can be dissociated and produce different benefits depending on the model and task.

Core claim

Muon applies a Newton-Schulz polar iteration that controls the spectral norm of the update while remaining equivariant under Schur gauge transformations on multiplicity matrices. PolarAdamW instead applies the same polar map to the direction obtained after AdamW's coordinatewise preconditioner, retaining spectral control but losing gauge-equivariance. The paper proves that the polar step is Schur gauge-equivariant while AdamW's step is not. On standard transformer training the hybrid yields higher accuracy than either parent method; on equivariant geometric regression the full Muon method is superior, establishing a double dissociation between the two mechanisms.

What carries the argument

PolarAdamW, the hybrid update that inserts AdamW's coordinatewise preconditioner before Muon's Newton-Schulz polar map, thereby preserving spectral-norm control while breaking Schur gauge-equivariance.

If this is right

Spectral control from the polar map can be usefully combined with AdamW-style preconditioning on standard transformer architectures.
Schur gauge-equivariance supplies additional benefit when the architecture has non-trivial multiplicity-basis freedom, as in SO(3)-equivariant models.
Both polar-based methods continue to outperform plain AdamW on the tested tasks.
The hybrid PolarAdamW runs at wall-clock cost comparable to Muon.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The hybridization technique could be applied to other existing optimizers to isolate the contribution of gauge symmetries.
Designers of geometric or physics-informed networks may gain from choosing optimizers that match the symmetry group of the model.
Future work could make the degree of gauge-equivariance in the optimizer adaptive to the current layer or training phase.

Load-bearing premise

The observed accuracy gaps between PolarAdamW and Muon are caused specifically by the presence or absence of Schur gauge-equivariance rather than by other algorithmic details or experimental factors.

What would settle it

Re-running the SO(3) point-cloud regression with PolarAdamW and finding that it matches or exceeds Muon at the audited capacities, or finding that the advantage of PolarAdamW over Muon disappears on the DeiT subsets under identical random seeds and tuning, would falsify the dissociation.

Figures

Figures reproduced from arXiv: 2605.07067 by Haozhou Zhang.

**Figure 2.** Figure 2: DeiT-Tiny on four independently sampled 100-class ImageNet-1k subsets, 100 ep. Solid curves [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

read the original abstract

Muon's matrix-level update couples two distinct effects: spectral control via a polar map, and equivariance under orthogonal changes of multiplicity-space basis (Schur gauge-equivariance). We separate them with PolarAdamW, a controlled hybrid that preserves Muon's polar spectral-norm control but breaks the gauge-equivariance, since AdamW's coordinatewise preconditioner is basis-dependent. Algorithmically, PolarAdamW applies Muon's Newton-Schulz polar map to AdamW's preconditioned direction rather than to raw momentum, at per-iteration wall-time comparable to Muon. We prove that Muon's polar step is Schur gauge-equivariant on multiplicity matrices while AdamW's coordinatewise step is not. On DeiT-Tiny trained from scratch on four independently sampled 100-class subsets of ImageNet-1k, where multiplicity-basis freedom is trivial, PolarAdamW outperforms Muon by +1.93 pp in test accuracy on average and AdamW by +9.5 pp; under the 300-epoch DeiT-style recipe, it remains ahead of Muon by +1.37 pp and AdamW by +5.80 pp on average. On SO(3)-equivariant 3D point-cloud regression, where multiplicity-basis freedom is non-trivial, the ordering reverses: Muon outperforms PolarAdamW at every audited capacity, and the gap widens with capacity. Both matrix-polar optimisers continue to outperform AdamW. This double dissociation separates spectral control from Schur gauge-equivariance: the first composes well with AdamW preconditioning on standard transformers, while the second becomes consequential when multiplicity-basis freedom is structurally non-trivial.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces PolarAdamW as a hybrid optimizer that applies Muon's Newton-Schulz polar map to AdamW-preconditioned directions, thereby retaining spectral-norm control while breaking Schur gauge-equivariance. It asserts a proof that Muon's polar step is Schur gauge-equivariant on multiplicity matrices whereas AdamW's coordinatewise preconditioner is not. Experiments on DeiT-Tiny trained from scratch on four 100-class ImageNet-1k subsets show PolarAdamW outperforming Muon by +1.93 pp (and AdamW by +9.5 pp) on average; under a 300-epoch recipe the gaps are +1.37 pp and +5.80 pp. On SO(3)-equivariant 3D point-cloud regression the ordering reverses, with Muon ahead of PolarAdamW at every audited capacity (both still ahead of AdamW). The authors interpret this double dissociation as separating spectral control from gauge-equivariance.

Significance. If the claimed separation can be rigorously established, the work clarifies distinct mechanistic contributions within matrix optimizers and may guide selective incorporation of gauge-equivariance for tasks with nontrivial multiplicity freedom. The algorithmic construction at wall-time comparable to Muon is a practical strength, and the explicit contrast between trivial and non-trivial multiplicity regimes is a useful framing.

major comments (3)

[Abstract and theoretical analysis] Abstract and theoretical section: the proof that Muon's polar step is Schur gauge-equivariant while AdamW's coordinatewise step is not is asserted without any derivation, lemmas, or intermediate equations, so the central mathematical claim that PolarAdamW cleanly isolates the two effects cannot be verified from the provided material.
[Experimental evaluation] Experimental evaluation (DeiT-Tiny and SO(3) sections): the double dissociation attributes the performance reversal specifically to the presence/absence of Schur gauge-equivariance, yet the two tasks differ simultaneously in architecture (transformer vs. equivariant network), objective, data distribution, capacity scaling, and training length; no within-task ablation varies only the multiplicity-basis freedom, so the observed ordering cannot be unambiguously linked to gauge-equivariance rather than other interactions.
[Experimental evaluation] Reported accuracy gaps (e.g., +1.93 pp and +1.37 pp averages): no error bars, standard deviations across the four subsets, or statistical significance tests are supplied, weakening the quantitative support for the claimed superiority and reversal.

minor comments (1)

[Algorithmic definition] The algorithmic description of PolarAdamW would be clearer with explicit pseudocode or a step-by-step listing of the update rule, especially the precise point at which the polar map is applied to the preconditioned direction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and describe the changes we will make in the revised version.

read point-by-point responses

Referee: [Abstract and theoretical analysis] Abstract and theoretical section: the proof that Muon's polar step is Schur gauge-equivariant while AdamW's coordinatewise step is not is asserted without any derivation, lemmas, or intermediate equations, so the central mathematical claim that PolarAdamW cleanly isolates the two effects cannot be verified from the provided material.

Authors: We agree that the proof requires more detailed exposition to allow verification. In the revised manuscript, we will include a complete derivation of the Schur gauge-equivariance property for the polar step, supported by appropriate lemmas and intermediate equations. This will explicitly show why Muon's polar map is equivariant on multiplicity matrices and why AdamW's preconditioner breaks this property, thereby clarifying how PolarAdamW isolates the spectral control effect. revision: yes
Referee: [Experimental evaluation] Experimental evaluation (DeiT-Tiny and SO(3) sections): the double dissociation attributes the performance reversal specifically to the presence/absence of Schur gauge-equivariance, yet the two tasks differ simultaneously in architecture (transformer vs. equivariant network), objective, data distribution, capacity scaling, and training length; no within-task ablation varies only the multiplicity-basis freedom, so the observed ordering cannot be unambiguously linked to gauge-equivariance rather than other interactions.

Authors: We recognize that the experimental setups differ in multiple dimensions, which limits the ability to isolate gauge-equivariance as the sole causal factor. Our intention was to demonstrate the effect in regimes where multiplicity-basis freedom is trivial versus non-trivial by construction. We will revise the manuscript to include an expanded discussion of these design choices and their implications for interpreting the double dissociation. While a controlled within-task ablation would be desirable, it would require fundamentally different experimental setups that may not preserve the original task characteristics; we therefore provide this as a limitation in the revised text. revision: partial
Referee: [Experimental evaluation] Reported accuracy gaps (e.g., +1.93 pp and +1.37 pp averages): no error bars, standard deviations across the four subsets, or statistical significance tests are supplied, weakening the quantitative support for the claimed superiority and reversal.

Authors: We concur that the absence of variability measures and statistical tests weakens the quantitative claims. In the revision, we will add standard deviations across the four subsets for the DeiT-Tiny results, include error bars in the relevant figures, and report statistical significance tests (such as t-tests) for the performance differences. Similar updates will be made for the SO(3) experiments where multiple runs are available. revision: yes

Circularity Check

0 steps flagged

No significant circularity: direct algorithmic hybrid and independent mathematical proof

full rationale

The paper's central construction defines PolarAdamW procedurally as applying the Newton-Schulz polar map (from Muon) to AdamW's preconditioned direction. This is an explicit combination of two existing algorithms with no data-dependent fitting, no parameter estimation from target quantities, and no reduction of the claimed separation to prior fitted results. The gauge-equivariance proof is presented as a standalone mathematical argument on multiplicity matrices. Empirical comparisons are reported on distinct tasks without any 'prediction' step that reuses fitted inputs. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work are invoked as load-bearing justifications. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard linear-algebra facts about polar decomposition and preconditioning; no new free parameters, ad-hoc axioms, or invented entities are introduced in the abstract.

axioms (1)

standard math Newton-Schulz iteration produces the polar factor of a matrix
Invoked as the basis for Muon's spectral control step.

pith-pipeline@v0.9.0 · 5599 in / 1301 out tokens · 48960 ms · 2026-05-11T01:26:21.443713+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 5 internal anchors

[1]

2024 , url =

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

work page 2024
[2]

2025 , url =

Bernstein, Jeremy , title =. 2025 , url =

work page 2025
[3]

and Mahony, R

Absil, P.-A. and Mahony, R. and Sepulchre, R. , title =

work page
[4]

Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

Bernstein, Jeremy and Newhouse, Laker , title =. arXiv preprint arXiv:2409.20325 , year =

work page arXiv
[15]

2025 , doi =

Bird, George , title =. 2025 , doi =

work page 2025
[16]

International Conference on Machine Learning (ICML) , year =

Gupta, Vineet and Koren, Tomer and Singer, Yoram , title =. International Conference on Machine Learning (ICML) , year =

work page
[17]

and Welling, Max , title =

Cohen, Taco S. and Welling, Max , title =. International Conference on Machine Learning (ICML) , year =

work page
[18]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Weiler, Maurice and Geiger, Mario and Welling, Max and Boomsma, Wouter and Cohen, Taco , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[19]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Weiler, Maurice and Cesa, Gabriele , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[20]

European Conference on Computer Vision (ECCV) , year =

Tian, Yonglong and Krishnan, Dilip and Isola, Phillip , title =. European Conference on Computer Vision (ECCV) , year =

work page
[21]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Batatia, Ilyes and Kov. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[22]

International Conference on Machine Learning (ICML) , year =

Zhdanov, Maksim and Ruhe, David and Weiler, Maurice and Lucic, Ana and Brandstetter, Johannes and Forr. International Conference on Machine Learning (ICML) , year =

work page
[23]

and Maes, Lucas and Milligan, Alan and Jolicoeur-Martineau, Alexia and Mitliagkas, Ioannis and Scieur, Damien and Lacoste-Julien, Simon and Guille-Escuret, Charles , title =

Zhang, Tianyue H. and Maes, Lucas and Milligan, Alan and Jolicoeur-Martineau, Alexia and Mitliagkas, Ioannis and Scieur, Damien and Lacoste-Julien, Simon and Guille-Escuret, Charles , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[24]

and Kornbluth, Mordechai and Molinari, Nicola and Smidt, Tess E

Batzner, Simon and Musaelian, Albert and Sun, Lixin and Geiger, Mario and Mailoa, Jonathan P. and Kornbluth, Mordechai and Molinari, Nicola and Smidt, Tess E. and Kozinsky, Boris , title =. Nature Communications , volume =. 2022 , doi =

work page 2022
[25]

Transactions on Machine Learning Research (TMLR) , year =

Nordenfors, Oskar and Ohlsson, Fredrik and Flinth, Axel , title =. Transactions on Machine Learning Research (TMLR) , year =

work page
[26]

International Conference on Learning Representations , year =

Adam: A Method for Stochastic Optimization , author =. International Conference on Learning Representations , year =

work page
[27]

International Conference on Learning Representations (ICLR) , year =

Ilya Loshchilov and Frank Hutter , title =. International Conference on Learning Representations (ICLR) , year =

work page
[28]

Proceedings of the 38th International Conference on Machine Learning (ICML) , year =

Hugo Touvron and Matthieu Cord and Matthijs Douze and Francisco Massa and Alexandre Sablayrolles and Herv\'e J\'egou , title =. Proceedings of the 38th International Conference on Machine Learning (ICML) , year =

work page
[29]

GitHub repository , doi =

Ross Wightman , title =. GitHub repository , doi =. 2019 , publisher =

work page 2019
[30]

, booktitle =

Vyas, Nikhil and Morwani, Depen and Zhao, Rosie and Shapira, Itai and Brandfonbrener, David and Janson, Lucas and Kakade, Sham M. , booktitle =

work page
[31]

The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

Amsel, N. , Persson, D. , Musco, C. and Gower, R. M. (2025). The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm . arXiv preprint arXiv:2505.16932

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

, Kov \'a cs, D

Batatia, I. , Kov \'a cs, D. P. , Simm, G. N. C. , Ortner, C. and Cs \'a nyi, G. (2022). MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields . In Advances in Neural Information Processing Systems (NeurIPS)

work page 2022
[33]

, Musaelian, A

Batzner, S. , Musaelian, A. , Sun, L. , Geiger, M. , Mailoa, J. P. , Kornbluth, M. , Molinari, N. , Smidt, T. E. and Kozinsky, B. (2022). E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials . Nature Communications 13 2453

work page 2022
[34]

(2025 a )

Bernstein, J. (2025 a ). Deriving Muon . Blog post, dated 7 Mar 2025. ://jeremybernste.in/writing/deriving-muon

work page 2025
[35]

(2025 b )

Bernstein, J. (2025 b ). The Modula Docs . Online documentation. ://docs.modula.systems/

work page 2025
[36]

, Shi, Q

Chang, D. , Shi, Q. , Zhang, L. , Li, Y. , Zhang, R. , Lu, Y. , Liu, Y. and Yuan, G. (2026). MuonEq : Balancing before orthogonalization with lightweight equilibration. arXiv preprint arXiv:2603.28254

work page internal anchor Pith review arXiv 2026
[37]

Cohen, T. S. and Welling, M. (2016). Group Equivariant Convolutional Networks . In International Conference on Machine Learning (ICML)

work page 2016
[38]

, Koren, T

Gupta, V. , Koren, T. and Singer, Y. (2018). Shampoo: Preconditioned Stochastic Tensor Optimization . In International Conference on Machine Learning (ICML)

work page 2018
[39]

, Jin, Y

Jordan, K. , Jin, Y. , Boza, V. , You, J. , Cesista, F. , Newhouse, L. and Bernstein, J. (2024). Muon: An optimizer for hidden layers in neural networks. Reference implementation: https://github.com/KellerJordan/Muon. ://kellerjordan.github.io/posts/muon/

work page 2024
[40]

Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations. ://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2015
[41]

Lau, T. T.-K. , Long, Q. and Su, W. (2025). PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective . arXiv preprint arXiv:2505.21799

work page arXiv 2025
[42]

Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025

Li, Z. , Liu, L. , Liang, C. , Chen, W. and Zhao, T. (2025). NorMuon: Making Muon more efficient and scalable . arXiv preprint arXiv:2510.05491

work page arXiv 2025
[43]

, Bansal, R

Liu, B. , Bansal, R. , Morwani, D. , Vyas, N. , Alvarez-Melis, D. and Kakade, S. M. (2025). Adam or Gauss--Newton? A Comparative Study in Terms of Basis Alignment and SGD Noise . arXiv preprint arXiv:2510.13680

work page arXiv 2025
[44]

Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning

Liu, Z. , Zhang, R. , Wang, Z. , Zhao, Y. , Su, Y. , Yang, Z. and Zhang, Z. (2026). Muon ^2 : Boosting Muon via adaptive second-moment preconditioning. arXiv preprint arXiv:2604.09967

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

and Hutter, F

Loshchilov, I. and Hutter, F. (2019). Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR)

work page 2019
[46]

, Ohlsson, F

Nordenfors, O. , Ohlsson, F. and Flinth, A. (2025). Optimization Dynamics of Equivariant and Augmented Neural Networks . Transactions on Machine Learning Research (TMLR)

work page 2025
[47]

NS-RGS: Newton-Schulz based Riemannian gradient method for orthogonal group synchronization

Peng, H. , Han, D. , Chen, X. and Huang, M. (2026). NS-RGS: Newton--Schulz based Riemannian gradient method for orthogonal group synchronization . arXiv preprint arXiv:2604.07372

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005, 2025

Si, C. , Zhang, D. and Shen, W. (2025). AdaMuon: Adaptive Muon Optimizer . arXiv preprint arXiv:2507.11005

work page arXiv 2025
[49]

, Cord, M

Touvron, H. , Cord, M. , Douze, M. , Massa, F. , Sablayrolles, A. and J\'egou, H. (2021). Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning (ICML)

work page 2021
[50]

, Morwani, D

Vyas, N. , Morwani, D. , Zhao, R. , Shapira, I. , Brandfonbrener, D. , Janson, L. and Kakade, S. M. (2025). SOAP : Improving and stabilizing Shampoo using Adam for language modeling. In International Conference on Learning Representations (ICLR)

work page 2025
[51]

and Cesa, G

Weiler, M. and Cesa, G. (2019). General E(2) -Equivariant Steerable CNNs . In Advances in Neural Information Processing Systems (NeurIPS)

work page 2019
[52]

Wightman, R. (2019). Pytorch image models. https://github.com/rwightman/pytorch-image-models

work page 2019
[53]

Fismo: Fisher-structured momentum- orthogonalized optimizer.ArXiv, abs/2601.21750, 2026

Xu, C. , Yan, W. and Zhang, Y.-J. A. (2026). FISMO : Fisher-structured momentum-orthogonalized optimizer. arXiv preprint arXiv:2601.21750

work page arXiv 2026
[54]

Zhang, T. H. , Maes, L. , Milligan, A. , Jolicoeur-Martineau, A. , Mitliagkas, I. , Scieur, D. , Lacoste-Julien, S. and Guille-Escuret, C. (2025). Understanding Adam Requires Better Rotation Dependent Assumptions . In Advances in Neural Information Processing Systems (NeurIPS)

work page 2025
[55]

Mousse: Rectifying the geometry of muon with curvature-aware precondition- ing, 2026

Zhang, Y. , Xing, S. , Huang, J. , Lv, K. , Zhou, Y. , Qiu, X. , Guo, Q. and Chen, K. (2026). Mousse: Rectifying the geometry of Muon with curvature-aware preconditioning. arXiv preprint arXiv:2603.09697

work page arXiv 2026
[56]

, Ruhe, D

Zhdanov, M. , Ruhe, D. , Weiler, M. , Lucic, A. , Brandstetter, J. and Forr \'e , P. (2024). Clifford-Steerable Convolutional Neural Networks . In International Conference on Machine Learning (ICML)

work page 2024