Recognition: no theorem link
PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix Optimisation
Pith reviewed 2026-05-11 01:26 UTC · model grok-4.3
The pith
PolarAdamW isolates spectral control from Schur gauge-equivariance to test their separate effects in matrix optimization
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Muon applies a Newton-Schulz polar iteration that controls the spectral norm of the update while remaining equivariant under Schur gauge transformations on multiplicity matrices. PolarAdamW instead applies the same polar map to the direction obtained after AdamW's coordinatewise preconditioner, retaining spectral control but losing gauge-equivariance. The paper proves that the polar step is Schur gauge-equivariant while AdamW's step is not. On standard transformer training the hybrid yields higher accuracy than either parent method; on equivariant geometric regression the full Muon method is superior, establishing a double dissociation between the two mechanisms.
What carries the argument
PolarAdamW, the hybrid update that inserts AdamW's coordinatewise preconditioner before Muon's Newton-Schulz polar map, thereby preserving spectral-norm control while breaking Schur gauge-equivariance.
If this is right
- Spectral control from the polar map can be usefully combined with AdamW-style preconditioning on standard transformer architectures.
- Schur gauge-equivariance supplies additional benefit when the architecture has non-trivial multiplicity-basis freedom, as in SO(3)-equivariant models.
- Both polar-based methods continue to outperform plain AdamW on the tested tasks.
- The hybrid PolarAdamW runs at wall-clock cost comparable to Muon.
Where Pith is reading between the lines
- The hybridization technique could be applied to other existing optimizers to isolate the contribution of gauge symmetries.
- Designers of geometric or physics-informed networks may gain from choosing optimizers that match the symmetry group of the model.
- Future work could make the degree of gauge-equivariance in the optimizer adaptive to the current layer or training phase.
Load-bearing premise
The observed accuracy gaps between PolarAdamW and Muon are caused specifically by the presence or absence of Schur gauge-equivariance rather than by other algorithmic details or experimental factors.
What would settle it
Re-running the SO(3) point-cloud regression with PolarAdamW and finding that it matches or exceeds Muon at the audited capacities, or finding that the advantage of PolarAdamW over Muon disappears on the DeiT subsets under identical random seeds and tuning, would falsify the dissociation.
Figures
read the original abstract
Muon's matrix-level update couples two distinct effects: spectral control via a polar map, and equivariance under orthogonal changes of multiplicity-space basis (Schur gauge-equivariance). We separate them with PolarAdamW, a controlled hybrid that preserves Muon's polar spectral-norm control but breaks the gauge-equivariance, since AdamW's coordinatewise preconditioner is basis-dependent. Algorithmically, PolarAdamW applies Muon's Newton-Schulz polar map to AdamW's preconditioned direction rather than to raw momentum, at per-iteration wall-time comparable to Muon. We prove that Muon's polar step is Schur gauge-equivariant on multiplicity matrices while AdamW's coordinatewise step is not. On DeiT-Tiny trained from scratch on four independently sampled 100-class subsets of ImageNet-1k, where multiplicity-basis freedom is trivial, PolarAdamW outperforms Muon by +1.93 pp in test accuracy on average and AdamW by +9.5 pp; under the 300-epoch DeiT-style recipe, it remains ahead of Muon by +1.37 pp and AdamW by +5.80 pp on average. On SO(3)-equivariant 3D point-cloud regression, where multiplicity-basis freedom is non-trivial, the ordering reverses: Muon outperforms PolarAdamW at every audited capacity, and the gap widens with capacity. Both matrix-polar optimisers continue to outperform AdamW. This double dissociation separates spectral control from Schur gauge-equivariance: the first composes well with AdamW preconditioning on standard transformers, while the second becomes consequential when multiplicity-basis freedom is structurally non-trivial.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PolarAdamW as a hybrid optimizer that applies Muon's Newton-Schulz polar map to AdamW-preconditioned directions, thereby retaining spectral-norm control while breaking Schur gauge-equivariance. It asserts a proof that Muon's polar step is Schur gauge-equivariant on multiplicity matrices whereas AdamW's coordinatewise preconditioner is not. Experiments on DeiT-Tiny trained from scratch on four 100-class ImageNet-1k subsets show PolarAdamW outperforming Muon by +1.93 pp (and AdamW by +9.5 pp) on average; under a 300-epoch recipe the gaps are +1.37 pp and +5.80 pp. On SO(3)-equivariant 3D point-cloud regression the ordering reverses, with Muon ahead of PolarAdamW at every audited capacity (both still ahead of AdamW). The authors interpret this double dissociation as separating spectral control from gauge-equivariance.
Significance. If the claimed separation can be rigorously established, the work clarifies distinct mechanistic contributions within matrix optimizers and may guide selective incorporation of gauge-equivariance for tasks with nontrivial multiplicity freedom. The algorithmic construction at wall-time comparable to Muon is a practical strength, and the explicit contrast between trivial and non-trivial multiplicity regimes is a useful framing.
major comments (3)
- [Abstract and theoretical analysis] Abstract and theoretical section: the proof that Muon's polar step is Schur gauge-equivariant while AdamW's coordinatewise step is not is asserted without any derivation, lemmas, or intermediate equations, so the central mathematical claim that PolarAdamW cleanly isolates the two effects cannot be verified from the provided material.
- [Experimental evaluation] Experimental evaluation (DeiT-Tiny and SO(3) sections): the double dissociation attributes the performance reversal specifically to the presence/absence of Schur gauge-equivariance, yet the two tasks differ simultaneously in architecture (transformer vs. equivariant network), objective, data distribution, capacity scaling, and training length; no within-task ablation varies only the multiplicity-basis freedom, so the observed ordering cannot be unambiguously linked to gauge-equivariance rather than other interactions.
- [Experimental evaluation] Reported accuracy gaps (e.g., +1.93 pp and +1.37 pp averages): no error bars, standard deviations across the four subsets, or statistical significance tests are supplied, weakening the quantitative support for the claimed superiority and reversal.
minor comments (1)
- [Algorithmic definition] The algorithmic description of PolarAdamW would be clearer with explicit pseudocode or a step-by-step listing of the update rule, especially the precise point at which the polar map is applied to the preconditioned direction.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and describe the changes we will make in the revised version.
read point-by-point responses
-
Referee: [Abstract and theoretical analysis] Abstract and theoretical section: the proof that Muon's polar step is Schur gauge-equivariant while AdamW's coordinatewise step is not is asserted without any derivation, lemmas, or intermediate equations, so the central mathematical claim that PolarAdamW cleanly isolates the two effects cannot be verified from the provided material.
Authors: We agree that the proof requires more detailed exposition to allow verification. In the revised manuscript, we will include a complete derivation of the Schur gauge-equivariance property for the polar step, supported by appropriate lemmas and intermediate equations. This will explicitly show why Muon's polar map is equivariant on multiplicity matrices and why AdamW's preconditioner breaks this property, thereby clarifying how PolarAdamW isolates the spectral control effect. revision: yes
-
Referee: [Experimental evaluation] Experimental evaluation (DeiT-Tiny and SO(3) sections): the double dissociation attributes the performance reversal specifically to the presence/absence of Schur gauge-equivariance, yet the two tasks differ simultaneously in architecture (transformer vs. equivariant network), objective, data distribution, capacity scaling, and training length; no within-task ablation varies only the multiplicity-basis freedom, so the observed ordering cannot be unambiguously linked to gauge-equivariance rather than other interactions.
Authors: We recognize that the experimental setups differ in multiple dimensions, which limits the ability to isolate gauge-equivariance as the sole causal factor. Our intention was to demonstrate the effect in regimes where multiplicity-basis freedom is trivial versus non-trivial by construction. We will revise the manuscript to include an expanded discussion of these design choices and their implications for interpreting the double dissociation. While a controlled within-task ablation would be desirable, it would require fundamentally different experimental setups that may not preserve the original task characteristics; we therefore provide this as a limitation in the revised text. revision: partial
-
Referee: [Experimental evaluation] Reported accuracy gaps (e.g., +1.93 pp and +1.37 pp averages): no error bars, standard deviations across the four subsets, or statistical significance tests are supplied, weakening the quantitative support for the claimed superiority and reversal.
Authors: We concur that the absence of variability measures and statistical tests weakens the quantitative claims. In the revision, we will add standard deviations across the four subsets for the DeiT-Tiny results, include error bars in the relevant figures, and report statistical significance tests (such as t-tests) for the performance differences. Similar updates will be made for the SO(3) experiments where multiple runs are available. revision: yes
Circularity Check
No significant circularity: direct algorithmic hybrid and independent mathematical proof
full rationale
The paper's central construction defines PolarAdamW procedurally as applying the Newton-Schulz polar map (from Muon) to AdamW's preconditioned direction. This is an explicit combination of two existing algorithms with no data-dependent fitting, no parameter estimation from target quantities, and no reduction of the claimed separation to prior fitted results. The gauge-equivariance proof is presented as a standalone mathematical argument on multiplicity matrices. Empirical comparisons are reported on distinct tasks without any 'prediction' step that reuses fitted inputs. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work are invoked as load-bearing justifications. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Newton-Schulz iteration produces the polar factor of a matrix
Reference graph
Works this paper leans on
-
[1]
Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =
work page 2024
- [2]
- [3]
-
[4]
Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024
Bernstein, Jeremy and Newhouse, Laker , title =. arXiv preprint arXiv:2409.20325 , year =
- [15]
-
[16]
International Conference on Machine Learning (ICML) , year =
Gupta, Vineet and Koren, Tomer and Singer, Yoram , title =. International Conference on Machine Learning (ICML) , year =
-
[17]
Cohen, Taco S. and Welling, Max , title =. International Conference on Machine Learning (ICML) , year =
-
[18]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Weiler, Maurice and Geiger, Mario and Welling, Max and Boomsma, Wouter and Cohen, Taco , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[19]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Weiler, Maurice and Cesa, Gabriele , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[20]
European Conference on Computer Vision (ECCV) , year =
Tian, Yonglong and Krishnan, Dilip and Isola, Phillip , title =. European Conference on Computer Vision (ECCV) , year =
-
[21]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Batatia, Ilyes and Kov. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[22]
International Conference on Machine Learning (ICML) , year =
Zhdanov, Maksim and Ruhe, David and Weiler, Maurice and Lucic, Ana and Brandstetter, Johannes and Forr. International Conference on Machine Learning (ICML) , year =
-
[23]
Zhang, Tianyue H. and Maes, Lucas and Milligan, Alan and Jolicoeur-Martineau, Alexia and Mitliagkas, Ioannis and Scieur, Damien and Lacoste-Julien, Simon and Guille-Escuret, Charles , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[24]
and Kornbluth, Mordechai and Molinari, Nicola and Smidt, Tess E
Batzner, Simon and Musaelian, Albert and Sun, Lixin and Geiger, Mario and Mailoa, Jonathan P. and Kornbluth, Mordechai and Molinari, Nicola and Smidt, Tess E. and Kozinsky, Boris , title =. Nature Communications , volume =. 2022 , doi =
work page 2022
-
[25]
Transactions on Machine Learning Research (TMLR) , year =
Nordenfors, Oskar and Ohlsson, Fredrik and Flinth, Axel , title =. Transactions on Machine Learning Research (TMLR) , year =
-
[26]
International Conference on Learning Representations , year =
Adam: A Method for Stochastic Optimization , author =. International Conference on Learning Representations , year =
-
[27]
International Conference on Learning Representations (ICLR) , year =
Ilya Loshchilov and Frank Hutter , title =. International Conference on Learning Representations (ICLR) , year =
-
[28]
Proceedings of the 38th International Conference on Machine Learning (ICML) , year =
Hugo Touvron and Matthieu Cord and Matthijs Douze and Francisco Massa and Alexandre Sablayrolles and Herv\'e J\'egou , title =. Proceedings of the 38th International Conference on Machine Learning (ICML) , year =
-
[29]
Ross Wightman , title =. GitHub repository , doi =. 2019 , publisher =
work page 2019
-
[30]
Vyas, Nikhil and Morwani, Depen and Zhao, Rosie and Shapira, Itai and Brandfonbrener, David and Janson, Lucas and Kakade, Sham M. , booktitle =
-
[31]
The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm
Amsel, N. , Persson, D. , Musco, C. and Gower, R. M. (2025). The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm . arXiv preprint arXiv:2505.16932
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Batatia, I. , Kov \'a cs, D. P. , Simm, G. N. C. , Ortner, C. and Cs \'a nyi, G. (2022). MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields . In Advances in Neural Information Processing Systems (NeurIPS)
work page 2022
-
[33]
Batzner, S. , Musaelian, A. , Sun, L. , Geiger, M. , Mailoa, J. P. , Kornbluth, M. , Molinari, N. , Smidt, T. E. and Kozinsky, B. (2022). E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials . Nature Communications 13 2453
work page 2022
- [34]
- [35]
-
[36]
Chang, D. , Shi, Q. , Zhang, L. , Li, Y. , Zhang, R. , Lu, Y. , Liu, Y. and Yuan, G. (2026). MuonEq : Balancing before orthogonalization with lightweight equilibration. arXiv preprint arXiv:2603.28254
work page internal anchor Pith review arXiv 2026
-
[37]
Cohen, T. S. and Welling, M. (2016). Group Equivariant Convolutional Networks . In International Conference on Machine Learning (ICML)
work page 2016
-
[38]
Gupta, V. , Koren, T. and Singer, Y. (2018). Shampoo: Preconditioned Stochastic Tensor Optimization . In International Conference on Machine Learning (ICML)
work page 2018
- [39]
-
[40]
Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations. ://arxiv.org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2015
- [41]
-
[42]
Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025
Li, Z. , Liu, L. , Liang, C. , Chen, W. and Zhao, T. (2025). NorMuon: Making Muon more efficient and scalable . arXiv preprint arXiv:2510.05491
-
[43]
Liu, B. , Bansal, R. , Morwani, D. , Vyas, N. , Alvarez-Melis, D. and Kakade, S. M. (2025). Adam or Gauss--Newton? A Comparative Study in Terms of Basis Alignment and SGD Noise . arXiv preprint arXiv:2510.13680
-
[44]
Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning
Liu, Z. , Zhang, R. , Wang, Z. , Zhao, Y. , Su, Y. , Yang, Z. and Zhang, Z. (2026). Muon ^2 : Boosting Muon via adaptive second-moment preconditioning. arXiv preprint arXiv:2604.09967
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[45]
Loshchilov, I. and Hutter, F. (2019). Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR)
work page 2019
-
[46]
Nordenfors, O. , Ohlsson, F. and Flinth, A. (2025). Optimization Dynamics of Equivariant and Augmented Neural Networks . Transactions on Machine Learning Research (TMLR)
work page 2025
-
[47]
NS-RGS: Newton-Schulz based Riemannian gradient method for orthogonal group synchronization
Peng, H. , Han, D. , Chen, X. and Huang, M. (2026). NS-RGS: Newton--Schulz based Riemannian gradient method for orthogonal group synchronization . arXiv preprint arXiv:2604.07372
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[48]
Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005, 2025
Si, C. , Zhang, D. and Shen, W. (2025). AdaMuon: Adaptive Muon Optimizer . arXiv preprint arXiv:2507.11005
- [49]
-
[50]
Vyas, N. , Morwani, D. , Zhao, R. , Shapira, I. , Brandfonbrener, D. , Janson, L. and Kakade, S. M. (2025). SOAP : Improving and stabilizing Shampoo using Adam for language modeling. In International Conference on Learning Representations (ICLR)
work page 2025
-
[51]
Weiler, M. and Cesa, G. (2019). General E(2) -Equivariant Steerable CNNs . In Advances in Neural Information Processing Systems (NeurIPS)
work page 2019
-
[52]
Wightman, R. (2019). Pytorch image models. https://github.com/rwightman/pytorch-image-models
work page 2019
-
[53]
Fismo: Fisher-structured momentum- orthogonalized optimizer.ArXiv, abs/2601.21750, 2026
Xu, C. , Yan, W. and Zhang, Y.-J. A. (2026). FISMO : Fisher-structured momentum-orthogonalized optimizer. arXiv preprint arXiv:2601.21750
-
[54]
Zhang, T. H. , Maes, L. , Milligan, A. , Jolicoeur-Martineau, A. , Mitliagkas, I. , Scieur, D. , Lacoste-Julien, S. and Guille-Escuret, C. (2025). Understanding Adam Requires Better Rotation Dependent Assumptions . In Advances in Neural Information Processing Systems (NeurIPS)
work page 2025
-
[55]
Mousse: Rectifying the geometry of muon with curvature-aware precondition- ing, 2026
Zhang, Y. , Xing, S. , Huang, J. , Lv, K. , Zhou, Y. , Qiu, X. , Guo, Q. and Chen, K. (2026). Mousse: Rectifying the geometry of Muon with curvature-aware preconditioning. arXiv preprint arXiv:2603.09697
- [56]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.