When Muon Optimizer Meets Adversarial Training: A Theoretical and Empirical Study

Chengliang Wu; Jiankai Zuo; Jun Yan; Weiquan Huang; Xi Fang; Yujian Mo; Zeming Wei

arxiv: 2605.26929 · v2 · pith:EZSQCAYVnew · submitted 2026-05-26 · 💻 cs.LG

When Muon Optimizer Meets Adversarial Training: A Theoretical and Empirical Study

Jun Yan , Weiquan Huang , Jiankai Zuo , Yujian Mo , Xi Fang , Chengliang Wu , Zeming Wei This is my paper

Pith reviewed 2026-06-29 19:38 UTC · model grok-4.3

classification 💻 cs.LG

keywords Muon optimizeradversarial trainingspectral norm stabilityorthogonal updatesrobust optimizationCNNvision transformer

0 comments

The pith

Muon imposes a spectral-norm stability ceiling on matrix updates during adversarial training, limiting uncontrolled spectral growth without shrinking weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the Muon optimizer, through its use of approximate polar decomposition to orthogonalize updates, creates a spectral-norm bound that stabilizes training dynamics in adversarial settings. This addresses why adaptive optimizers sometimes hurt robustness while SGD works, by focusing on the geometry of updates rather than just the learning rate. If true, it means optimizer choice directly affects the security properties of trained models under attacks like l-infinity, l1, and l2 norms. A reader would care because it turns a practical training detail into a lever for better defenses without added computation.

Core claim

Muon imposes a spectral-norm stability ceiling on matrix updates, limiting uncontrolled spectral growth in the training dynamics without explicitly shrinking the learned weights. Empirically, across five architectures and three lp threat models and their union, Muon is competitive with SGD on CNNs and substantially outperforms AdamW on both CNNs and ViTs.

What carries the argument

Approximate polar decomposition to orthogonalize matrix-valued updates, which imposes the spectral-norm stability ceiling.

If this is right

Muon achieves robustness competitive with SGD on CNNs under multiple threat models.
Muon substantially outperforms AdamW on CNNs and vision transformers in adversarial training.
Optimizer geometry acts as a security-relevant factor affecting robustness.
Orthogonalized updates prove beneficial in regimes where adaptive methods fall short.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar orthogonalization techniques could be applied to other optimizers to test if they gain similar stability in AT.
The approach might scale to larger models where spectral growth is more problematic.
Future work could explore whether this stability holds for other min-max optimization problems beyond AT.

Load-bearing premise

The approximate polar decomposition in Muon provides a reliable spectral-norm bound that holds throughout the adversarial training process on practical models and attacks.

What would settle it

An experiment showing that Muon allows spectral norms to grow uncontrollably during adversarial training, or that it fails to match SGD robustness on standard benchmarks.

Figures

Figures reproduced from arXiv: 2605.26929 by Chengliang Wu, Jiankai Zuo, Jun Yan, Weiquan Huang, Xi Fang, Yujian Mo, Zeming Wei.

**Figure 2.** Figure 2: Per-layer condition number change during training. [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗

**Figure 3.** Figure 3: Spectral conditioning and robustness. These diagnostics show that SGD often achieves the smallest absolute spectral norm, consistent with its strong robustness under AT. Muon does not uniformly dominate SGD in terms of raw spectralnorm magnitude. Instead, its spectral behavior is consistently closer to SGD than to AdamW, and it avoids the large spectral growth observed for AdamW in several settings. There… view at source ↗

**Figure 4.** Figure 4: The spectral norm change with different optimizers during AT. [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Training loss on PreActResNet-18 [24]. 0 20 40 60 80 100 Epoch 0.0 0.5 1.0 1.5 2.0 Loss Training Loss (ℓ∞ and ℓ1-norm) SGD optimizer AdamW optimizer Muon optimizer (a) ℓ∞ + ℓ1 0 20 40 60 80 100 Epoch 0.0 0.5 1.0 1.5 2.0 2.5 Loss Training Loss (ℓ∞-norm) SGD optimizer AdamW optimizer Muon optimizer (b) ℓ∞ 0 20 40 60 80 100 Epoch 0.0 0.5 1.0 1.5 2.0 Loss Training Loss (ℓ1-norm) SGD optimizer AdamW optimizer M… view at source ↗

**Figure 6.** Figure 6: Training loss on WRN-34-10 [51]. Under controlled conditions (identical learning rate of 0.01 for all optimizers on PreActResNet-18), [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Training loss on WRN-34-20 [51]. 0 20 40 60 80 100 Epoch 0 1 2 3 4 Loss Training Loss (ℓ∞ and ℓ1-norm) SGD optimizer AdamW optimizer Muon optimizer (a) ℓ∞ + ℓ1 0 20 40 60 80 100 Epoch 0 1 2 3 4 Loss Training Loss (ℓ∞-norm) SGD optimizer AdamW optimizer Muon optimizer (b) ℓ∞ 0 20 40 60 80 100 Epoch 0 1 2 3 4 Loss Training Loss (ℓ1-norm) SGD optimizer AdamW optimizer Muon optimizer (c) ℓ1 0 20 40 60 80 100 E… view at source ↗

**Figure 8.** Figure 8: Training loss on ViT-B [18]. 0 20 40 60 80 100 Epoch 0 1 2 3 4 Loss Training Loss (ℓ∞ and ℓ1-norm) SGD optimizer AdamW optimizer Muon optimizer (a) ℓ∞ + ℓ1 0 20 40 60 80 100 Epoch 0 1 2 3 4 Loss Training Loss (ℓ∞-norm) SGD optimizer AdamW optimizer Muon optimizer (b) ℓ∞ 0 20 40 60 80 100 Epoch 0 2 4 6 8 Loss Training Loss (ℓ1-norm) SGD optimizer AdamW optimizer Muon optimizer (c) ℓ1 0 20 40 60 80 100 Epoch… view at source ↗

**Figure 9.** Figure 9: Training loss on ViT-L [18]. 0 20 40 60 80 Epoch 0 2 4 6 8 10 12 14 N u cle ar n orm G * (m e a n o v er la y ers) (a) Nuclear Norm of Gradient SGD: G * AdamW: G * Muon: G * 0 20 40 60 80 Epoch 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Fro b e niu s n orm G F (m e a n o v er la y ers) (b) Frobenius Norm of Gradient SGD: G F AdamW: G F Muon: G F [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: The variation tendency of nuclear gradient descent norms. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Robustness evaluation during AT process. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

read the original abstract

Adversarial training (AT) remains one of the most reliable empirical defenses against adversarial attacks. Its robustness critically depends on how the underlying min-max objective is optimized. In practice, Stochastic Gradient Descent (SGD) optimizer remains the default optimization choice for AT, whereas adaptive optimizers often improve standard training but may yield inferior robustness. Recently, the Muon optimizer, which orthogonalizes matrix-valued updates via an approximate polar decomposition, has achieved notable success in large-scale training at a memory cost comparable to SGD. This raises a security-relevant question: \textit{can orthogonalized optimization improve AT under strong and heterogeneous threat models?} Focusing on this problem, we conduct a comprehensive theoretical and empirical study. Theoretically, we show that Muon imposes a spectral-norm stability ceiling on matrix updates, limiting uncontrolled spectral growth in the training dynamics without explicitly shrinking the learned weights. Empirically, across five architectures and three $\ell_p$ threat models ($\ell_\infty$, $\ell_1$, $\ell_2$) and their union, Muon is competitive with SGD on CNNs and substantially outperforms AdamW on both CNNs and ViTs. These results identify optimizer geometry as a security-relevant factor in adversarial training, while clarifying the empirical regimes in which orthogonalized updates are beneficial. Overall, our findings highlight optimizer design as a security-critical component of AT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Muon matches SGD and beats AdamW in some AT settings, but the spectral stability claim rests on an unanalyzed approximation whose error under adversarial gradients is not bounded.

read the letter

The paper applies the Muon optimizer to adversarial training and reports that it stays competitive with SGD on CNNs while beating AdamW across CNNs and ViTs under l-infinity, l1, l2, and union threat models. The new angle is the claim that Muon's approximate polar decomposition imposes a spectral-norm ceiling on updates, which limits uncontrolled growth without explicit weight shrinkage.

The empirical section is the stronger part. Five architectures and multiple threat models give a reasonable map of where the optimizer helps, and the results are presented without obvious hyperparameter cherry-picking in the abstract. That alone makes the work worth checking for anyone who tunes robust training pipelines.

The theory is thinner. The stability ceiling follows from the fact that an exact orthogonal factor has spectral norm 1, yet Muon uses a finite number of iterations to approximate the polar factor. No derivation bounds the deviation from norm 1 as a function of iteration count or gradient size, and adversarial training produces larger, more variable gradients than standard training. The experiments measure final robustness rather than tracking spectral norms of the updates during training, so they do not test whether the claimed ceiling actually holds. This gap is load-bearing for the theoretical contribution.

The work is aimed at practitioners choosing optimizers for robustness and at researchers interested in optimizer geometry. The empirical comparisons are concrete enough to justify referee time even if the theory needs tightening. I would send it to review rather than desk reject.

Referee Report

2 major / 0 minor

Summary. The paper claims that the Muon optimizer, which orthogonalizes matrix-valued updates via approximate polar decomposition, imposes a spectral-norm stability ceiling on updates during adversarial training (AT). This limits uncontrolled spectral growth without explicitly shrinking weights. Empirically, across five architectures and three ℓ_p threat models (ℓ_∞, ℓ_1, ℓ_2) plus their union, Muon is competitive with SGD on CNNs and substantially outperforms AdamW on both CNNs and ViTs, identifying optimizer geometry as a security-relevant factor in AT.

Significance. If the theoretical stability claim holds under the approximation used by Muon, the work would usefully connect optimizer geometry to robustness in min-max optimization and provide practical guidance on when orthogonalized updates are beneficial. The broad empirical scope across architectures and heterogeneous threat models is a strength that would make the findings actionable for AT practitioners.

major comments (2)

[Theoretical analysis] Theoretical section on spectral-norm ceiling: The claim that Muon imposes a spectral-norm stability ceiling rests on the exact polar decomposition property that the orthogonal factor satisfies ||U||_2 = 1. Muon replaces this with a finite number of Newton-style iterations, yet no bound is derived on ||U_approx||_2 − 1 as a function of iteration count, gradient magnitude, or the size of the adversarial perturbation. In AT the effective gradients are larger and more variable than in standard training, so the approximation error could permit spectral growth beyond the claimed ceiling.
[Empirical evaluation] Empirical results section: The reported robustness numbers compare final test accuracy under attack but do not include direct measurements (e.g., plots or tables) of the spectral norms of the matrix updates throughout training. Without such diagnostics it is impossible to confirm that the claimed stability ceiling is realized in practice under the studied threat models and architectures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and limitations of our theoretical and empirical claims. We address each major point below and indicate the planned revisions.

read point-by-point responses

Referee: [Theoretical analysis] Theoretical section on spectral-norm ceiling: The claim that Muon imposes a spectral-norm stability ceiling rests on the exact polar decomposition property that the orthogonal factor satisfies ||U||_2 = 1. Muon replaces this with a finite number of Newton-style iterations, yet no bound is derived on ||U_approx||_2 − 1 as a function of iteration count, gradient magnitude, or the size of the adversarial perturbation. In AT the effective gradients are larger and more variable than in standard training, so the approximation error could permit spectral growth beyond the claimed ceiling.

Authors: We agree that the theoretical analysis in the manuscript relies on the exact polar decomposition property (||U||_2 = 1) while Muon employs a finite Newton iteration approximation. No explicit error bound is derived as a function of iteration count or gradient statistics. We will add a new subsection discussing the quadratic convergence of the Newton iterations and the practical tolerance used in Muon (typically achieving ||U_approx||_2 within 10^{-4} of 1 after 5–6 steps). We will also note that the stability ceiling holds exactly only in the limit of the approximation and that larger adversarial gradients may increase the required iteration count; this limitation will be stated explicitly. revision: partial
Referee: [Empirical evaluation] Empirical results section: The reported robustness numbers compare final test accuracy under attack but do not include direct measurements (e.g., plots or tables) of the spectral norms of the matrix updates throughout training. Without such diagnostics it is impossible to confirm that the claimed stability ceiling is realized in practice under the studied threat models and architectures.

Authors: We concur that direct monitoring of spectral norms during training would strengthen the link between the theoretical ceiling and observed robustness. In the revised manuscript we will add a new figure (and corresponding table in the appendix) that tracks the spectral norm of the orthogonalized updates for Muon versus the raw gradient updates for SGD and AdamW across representative layers and threat models (ℓ_∞ and ℓ_2). These diagnostics will be computed on the same training runs reported in the main experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central theoretical claim—that Muon imposes a spectral-norm stability ceiling via polar decomposition—is presented as a direct mathematical consequence of the optimizer's update rule rather than a fitted quantity or self-referential definition. No equations, self-citations, or ansatzes in the provided text reduce the result to its own inputs by construction. The derivation chain remains independent of the empirical results and does not rely on load-bearing self-citations or renaming of known patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; ledger is therefore empty pending full text.

pith-pipeline@v0.9.1-grok · 5790 in / 1005 out tokens · 29719 ms · 2026-06-29T19:38:48.249477+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Feature purification: How adversarial training performs robust deep learning

Zeyuan Allen-Zhu and Yuanzhi Li. Feature purification: How adversarial training performs robust deep learning. InIEEE 62nd annual symposium on foundations of computer science (FOCS), pages 977–988, 2022

2022
[2]

Understanding and improving fast ad- versarial training.Advances in Neural Information Processing Systems, 33:16048–16059, 2020

Maksym Andriushchenko and Nicolas Flammarion. Understanding and improving fast ad- versarial training.Advances in Neural Information Processing Systems, 33:16048–16059, 2020

2020
[3]

Towards understanding sharpness-aware minimization

Maksym Andriushchenko and Nicolas Flammarion. Towards understanding sharpness-aware minimization. InInternational conference on machine learning, pages 639–668, 2022

2022
[4]

Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. InInternational conference on machine learning, pages 274–283, 2018

2018
[5]

Synthesizing robust adver- sarial examples

Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adver- sarial examples. InInternational conference on machine learning, pages 284–293, 2018

2018
[6]

The dynamics of sharpness-aware minimization: Bouncing across ravines and drifting towards wide minima.Journal of Machine Learning Research, 24(316):1–36, 2023

Peter L Bartlett, Philip M Long, and Olivier Bousquet. The dynamics of sharpness-aware minimization: Bouncing across ravines and drifting towards wide minima.Journal of Machine Learning Research, 24(316):1–36, 2023

2023
[7]

Modular duality in deep learning

Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning. InICML, 2025

2025
[8]

Stochastic gradient descent tricks

Léon Bottou. Stochastic gradient descent tricks. InNeural networks: tricks of the trade: second edition, pages 421–436. Springer, 2012

2012
[9]

Towards evaluating the robustness of neural networks

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (S&P), pages 39–57, 2017. 10

2017
[10]

Muon optimizes under spectral norm constraints

Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints. arXiv preprint arXiv:2506.15054, 2025

work page arXiv 2025
[11]

Certified adversarial robustness via randomized smoothing

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. Ininternational conference on machine learning, pages 1310–1320, 2019

2019
[12]

Robustbench: a standardized adversarial robustness benchmark

Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. Robustbench: a standardized adversarial robustness benchmark. InConference on Neural Information Processing Systems, 2021

2021
[13]

Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. InInternational conference on machine learning, pages 2206–2216, 2020

2020
[14]

Adversarial robustness against multiple and single l_p- threat models via quick fine-tuning of robust classifiers

Francesco Croce and Matthias Hein. Adversarial robustness against multiple and single l_p- threat models via quick fine-tuning of robust classifiers. InInternational Conference on Machine Learning, pages 4436–4454, 2022

2022
[15]

Revisiting outer optimization in adversarial training

Ali Dabouei, Fariborz Taherkhani, Sobhan Soleymani, and Nasser M Nasrabadi. Revisiting outer optimization in adversarial training. InEuropean Conference on Computer Vision, pages 244–261, 2022

2022
[16]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255, 2009

2009
[17]

Towards understanding the dynamics of the first-order adversaries

Zhun Deng, Hangfeng He, Jiaoyang Huang, and Weijie Su. Towards understanding the dynamics of the first-order adversaries. InInternational Conference on Machine Learning, pages 2484– 2493, 2020

2020
[18]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, pages 1–13, 2021

2021
[19]

An image is worth 16x16 words: Transformers for image recognition at scale.ArXiv

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.ArXiv
[20]

Sharpness-aware mini- mization for efficiently improving generalization

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware mini- mization for efficiently improving generalization. InICLR, 2021

2021
[21]

Convergence of adversarial training in overparametrized neural networks.Advances in Neural Information Processing Systems, 32, 2019

Ruiqi Gao, Tianle Cai, Haochuan Li, Cho-Jui Hsieh, Liwei Wang, and Jason D Lee. Convergence of adversarial training in overparametrized neural networks.Advances in Neural Information Processing Systems, 32, 2019

2019
[22]

Explaining and harnessing adversar- ial examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar- ial examples. InICLR, 2015

2015
[23]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016
[24]

Identity mappings in deep residual networks

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. InEuropean conference on computer vision, pages 630–645. Springer, 2016

2016
[25]

Formal guarantees on the robustness of a classifier against adversarial manipulation.Advances in neural information processing systems, 30, 2017

Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation.Advances in neural information processing systems, 30, 2017

2017
[26]

SIAM, 2008

Nicholas J Higham.Functions of matrices: theory and computation. SIAM, 2008. 11

2008
[27]

Improving dnn robustness to adversarial attacks using jacobian regularization

Daniel Jakubovitz and Raja Giryes. Improving dnn robustness to adversarial attacks using jacobian regularization. InProceedings of the European conference on computer vision (ECCV), pages 514–529, 2018

2018
[28]

Muon: An optimizer for hidden layers in neural networks

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks. https:// kellerjordan.github.io/posts/muon/, 2024

2024
[29]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2015

2015
[30]

A learning multiple layers of features from tiny images

Alex Krizhevsky and Geoffrey Hinton. A learning multiple layers of features from tiny images. http://www.cs.toronto.edu/~kriz/cifar.html, 2009

2009
[31]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms

Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, and Tuo Zhao. Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms. arXiv preprint arXiv:2502.17410, 2025

work page arXiv 2025
[33]

Bad global minima exist and sgd can reach them.Advances in Neural Information Processing Systems, 33:8543–8552, 2020

Shengchao Liu, Dimitris Papailiopoulos, and Dimitris Achlioptas. Bad global minima exist and sgd can reach them.Advances in Neural Information Processing Systems, 33:8543–8552, 2020

2020
[34]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2017

2017
[35]

Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods.Transactions on Machine Learning Research, 2023

Avery Ma, Yangchen Pan, and Amir-massoud Farahmand. Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods.Transactions on Machine Learning Research, 2023

2023
[36]

Towards deep learning models resistant to adversarial attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InICLR, 2018

2018
[37]

Training transformers with enforced lipschitz constants.arXiv preprint arXiv:2507.13338, 2025

Laker Newhouse, R Preston Hess, Franz Cesista, Andrii Zahorodnii, Jeremy Bernstein, and Phillip Isola. Training transformers with enforced lipschitz constants.arXiv preprint arXiv:2507.13338, 2025

work page arXiv 2025
[38]

Rethinking softmax cross-entropy loss for adversarial robustness

Tianyu Pang, Kun Xu, Yinpeng Dong, Chao Du, Ning Chen, and Jun Zhu. Rethinking softmax cross-entropy loss for adversarial robustness. InICLR, 2020

2020
[39]

Bag of tricks for adversarial training

Tianyu Pang, Xiao Yang, Yinpeng Dong, Hang Su, and Jun Zhu. Bag of tricks for adversarial training. InICLR, 2021

2021
[40]

A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

1951
[41]

Adversarial training for free!Advances in neural information processing systems, 32, 2019

Ali Shafahi, Mahyar Najibi, Mohammad Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free!Advances in neural information processing systems, 32, 2019

2019
[42]

On the Convergence Analysis of Muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon.arXiv preprint arXiv:2505.23737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Re- thinking the inception architecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re- thinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016

2016
[44]

Intriguing properties of neural networks

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Good- fellow, and Rob Fergus. Intriguing properties of neural networks. InICLR, 2014

2014
[45]

On adaptive attacks to adversarial example defenses.Advances in neural information processing systems, 33:1633–1645, 2020

Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks to adversarial example defenses.Advances in neural information processing systems, 33:1633–1645, 2020. 12

2020
[46]

Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks.Advances in neural information processing systems, 31, 2018

Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks.Advances in neural information processing systems, 31, 2018

2018
[47]

On the convergence and robustness of adversarial training

Yisen Wang, Xingjun Ma, James Bailey, Jinfeng Yi, Bowen Zhou, and Quanquan Gu. On the convergence and robustness of adversarial training. InInternational Conference on Machine Learning, pages 6586–6595, 2019

2019
[48]

Fast is better than free: Revisiting adversarial training

Eric Wong, Leslie Rice, and J Zico Kolter. Fast is better than free: Revisiting adversarial training. InICLR, 2020

2020
[49]

Adversarial weight perturbation helps robust generalization.Advances in neural information processing systems, 33:2958–2969, 2020

Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization.Advances in neural information processing systems, 33:2958–2969, 2020

2020
[50]

Feature squeezing: Detecting adversarial examples in deep neural networks

Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. InProceedings 2018 Network and Distributed System Security Symposium, 2018

2018
[51]

Wide residual networks

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. InBritish Machine Vision Conference, pages 1–15, 2016

2016
[52]

Theoretically principled trade-off between robustness and accuracy

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. InInternational conference on machine learning, pages 7472–7482, 2019

2019
[53]

On the duality between sharpness-aware minimization and adversarial training

Yihao Zhang, Hangzhou He, Jingyu Zhu, Huanran Chen, Yifei Wang, and Zeming Wei. On the duality between sharpness-aware minimization and adversarial training. InInternational Conference on Machine Learning, pages 59024–59041, 2024

2024
[54]

Sharpness-aware minimization efficiently selects flatter minima late in training

Zhanpeng Zhou, Mingze Wang, Yuchen Mao, Bingrui Li, and Junchi Yan. Sharpness-aware minimization efficiently selects flatter minima late in training. InICLR, 2025. 13 Algorithm 2One Muon Step under AT (expands line 10 of Alg. 1 when Opt = Muon; Newton– Schulz iteration inlined as sub-steps of S2). Require: Weight block W∈R m×n, momentum buffer M (init. 0)...

2025
[55]

Jacobian scale

Hence ∇xmy,j(x) =J gθ(x)⊤(ey −e j),(21) and therefore, it satisfies ∥∇xmy,j(x)∥2 ≤ √ 2∥J gθ(x)∥2 ≤ √ 2 LY ℓ=1 ∥Wℓ∥2,(22) where the last inequality follows from Proposition 2.2 assuming 1-Lipschitz activations. Otherwise, the product is replaced byL L−1 ϕ Q ℓ ∥Wℓ∥2. Since the multiclass margin can be written as mθ(x, y) =g θ(x)y −max j̸=y gθ(x)j = min j̸=y...

work page arXiv 1943

[1] [1]

Feature purification: How adversarial training performs robust deep learning

Zeyuan Allen-Zhu and Yuanzhi Li. Feature purification: How adversarial training performs robust deep learning. InIEEE 62nd annual symposium on foundations of computer science (FOCS), pages 977–988, 2022

2022

[2] [2]

Understanding and improving fast ad- versarial training.Advances in Neural Information Processing Systems, 33:16048–16059, 2020

Maksym Andriushchenko and Nicolas Flammarion. Understanding and improving fast ad- versarial training.Advances in Neural Information Processing Systems, 33:16048–16059, 2020

2020

[3] [3]

Towards understanding sharpness-aware minimization

Maksym Andriushchenko and Nicolas Flammarion. Towards understanding sharpness-aware minimization. InInternational conference on machine learning, pages 639–668, 2022

2022

[4] [4]

Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. InInternational conference on machine learning, pages 274–283, 2018

2018

[5] [5]

Synthesizing robust adver- sarial examples

Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adver- sarial examples. InInternational conference on machine learning, pages 284–293, 2018

2018

[6] [6]

The dynamics of sharpness-aware minimization: Bouncing across ravines and drifting towards wide minima.Journal of Machine Learning Research, 24(316):1–36, 2023

Peter L Bartlett, Philip M Long, and Olivier Bousquet. The dynamics of sharpness-aware minimization: Bouncing across ravines and drifting towards wide minima.Journal of Machine Learning Research, 24(316):1–36, 2023

2023

[7] [7]

Modular duality in deep learning

Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning. InICML, 2025

2025

[8] [8]

Stochastic gradient descent tricks

Léon Bottou. Stochastic gradient descent tricks. InNeural networks: tricks of the trade: second edition, pages 421–436. Springer, 2012

2012

[9] [9]

Towards evaluating the robustness of neural networks

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (S&P), pages 39–57, 2017. 10

2017

[10] [10]

Muon optimizes under spectral norm constraints

Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints. arXiv preprint arXiv:2506.15054, 2025

work page arXiv 2025

[11] [11]

Certified adversarial robustness via randomized smoothing

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. Ininternational conference on machine learning, pages 1310–1320, 2019

2019

[12] [12]

Robustbench: a standardized adversarial robustness benchmark

Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. Robustbench: a standardized adversarial robustness benchmark. InConference on Neural Information Processing Systems, 2021

2021

[13] [13]

Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. InInternational conference on machine learning, pages 2206–2216, 2020

2020

[14] [14]

Adversarial robustness against multiple and single l_p- threat models via quick fine-tuning of robust classifiers

Francesco Croce and Matthias Hein. Adversarial robustness against multiple and single l_p- threat models via quick fine-tuning of robust classifiers. InInternational Conference on Machine Learning, pages 4436–4454, 2022

2022

[15] [15]

Revisiting outer optimization in adversarial training

Ali Dabouei, Fariborz Taherkhani, Sobhan Soleymani, and Nasser M Nasrabadi. Revisiting outer optimization in adversarial training. InEuropean Conference on Computer Vision, pages 244–261, 2022

2022

[16] [16]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255, 2009

2009

[17] [17]

Towards understanding the dynamics of the first-order adversaries

Zhun Deng, Hangfeng He, Jiaoyang Huang, and Weijie Su. Towards understanding the dynamics of the first-order adversaries. InInternational Conference on Machine Learning, pages 2484– 2493, 2020

2020

[18] [18]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, pages 1–13, 2021

2021

[19] [19]

An image is worth 16x16 words: Transformers for image recognition at scale.ArXiv

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.ArXiv

[20] [20]

Sharpness-aware mini- mization for efficiently improving generalization

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware mini- mization for efficiently improving generalization. InICLR, 2021

2021

[21] [21]

Convergence of adversarial training in overparametrized neural networks.Advances in Neural Information Processing Systems, 32, 2019

Ruiqi Gao, Tianle Cai, Haochuan Li, Cho-Jui Hsieh, Liwei Wang, and Jason D Lee. Convergence of adversarial training in overparametrized neural networks.Advances in Neural Information Processing Systems, 32, 2019

2019

[22] [22]

Explaining and harnessing adversar- ial examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar- ial examples. InICLR, 2015

2015

[23] [23]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016

[24] [24]

Identity mappings in deep residual networks

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. InEuropean conference on computer vision, pages 630–645. Springer, 2016

2016

[25] [25]

Formal guarantees on the robustness of a classifier against adversarial manipulation.Advances in neural information processing systems, 30, 2017

Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation.Advances in neural information processing systems, 30, 2017

2017

[26] [26]

SIAM, 2008

Nicholas J Higham.Functions of matrices: theory and computation. SIAM, 2008. 11

2008

[27] [27]

Improving dnn robustness to adversarial attacks using jacobian regularization

Daniel Jakubovitz and Raja Giryes. Improving dnn robustness to adversarial attacks using jacobian regularization. InProceedings of the European conference on computer vision (ECCV), pages 514–529, 2018

2018

[28] [28]

Muon: An optimizer for hidden layers in neural networks

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks. https:// kellerjordan.github.io/posts/muon/, 2024

2024

[29] [29]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2015

2015

[30] [30]

A learning multiple layers of features from tiny images

Alex Krizhevsky and Geoffrey Hinton. A learning multiple layers of features from tiny images. http://www.cs.toronto.edu/~kriz/cifar.html, 2009

2009

[31] [31]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms

Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, and Tuo Zhao. Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms. arXiv preprint arXiv:2502.17410, 2025

work page arXiv 2025

[33] [33]

Bad global minima exist and sgd can reach them.Advances in Neural Information Processing Systems, 33:8543–8552, 2020

Shengchao Liu, Dimitris Papailiopoulos, and Dimitris Achlioptas. Bad global minima exist and sgd can reach them.Advances in Neural Information Processing Systems, 33:8543–8552, 2020

2020

[34] [34]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2017

2017

[35] [35]

Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods.Transactions on Machine Learning Research, 2023

Avery Ma, Yangchen Pan, and Amir-massoud Farahmand. Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods.Transactions on Machine Learning Research, 2023

2023

[36] [36]

Towards deep learning models resistant to adversarial attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InICLR, 2018

2018

[37] [37]

Training transformers with enforced lipschitz constants.arXiv preprint arXiv:2507.13338, 2025

Laker Newhouse, R Preston Hess, Franz Cesista, Andrii Zahorodnii, Jeremy Bernstein, and Phillip Isola. Training transformers with enforced lipschitz constants.arXiv preprint arXiv:2507.13338, 2025

work page arXiv 2025

[38] [38]

Rethinking softmax cross-entropy loss for adversarial robustness

Tianyu Pang, Kun Xu, Yinpeng Dong, Chao Du, Ning Chen, and Jun Zhu. Rethinking softmax cross-entropy loss for adversarial robustness. InICLR, 2020

2020

[39] [39]

Bag of tricks for adversarial training

Tianyu Pang, Xiao Yang, Yinpeng Dong, Hang Su, and Jun Zhu. Bag of tricks for adversarial training. InICLR, 2021

2021

[40] [40]

A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

1951

[41] [41]

Adversarial training for free!Advances in neural information processing systems, 32, 2019

Ali Shafahi, Mahyar Najibi, Mohammad Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free!Advances in neural information processing systems, 32, 2019

2019

[42] [42]

On the Convergence Analysis of Muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon.arXiv preprint arXiv:2505.23737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Re- thinking the inception architecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re- thinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016

2016

[44] [44]

Intriguing properties of neural networks

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Good- fellow, and Rob Fergus. Intriguing properties of neural networks. InICLR, 2014

2014

[45] [45]

On adaptive attacks to adversarial example defenses.Advances in neural information processing systems, 33:1633–1645, 2020

Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks to adversarial example defenses.Advances in neural information processing systems, 33:1633–1645, 2020. 12

2020

[46] [46]

Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks.Advances in neural information processing systems, 31, 2018

Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks.Advances in neural information processing systems, 31, 2018

2018

[47] [47]

On the convergence and robustness of adversarial training

Yisen Wang, Xingjun Ma, James Bailey, Jinfeng Yi, Bowen Zhou, and Quanquan Gu. On the convergence and robustness of adversarial training. InInternational Conference on Machine Learning, pages 6586–6595, 2019

2019

[48] [48]

Fast is better than free: Revisiting adversarial training

Eric Wong, Leslie Rice, and J Zico Kolter. Fast is better than free: Revisiting adversarial training. InICLR, 2020

2020

[49] [49]

Adversarial weight perturbation helps robust generalization.Advances in neural information processing systems, 33:2958–2969, 2020

Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization.Advances in neural information processing systems, 33:2958–2969, 2020

2020

[50] [50]

Feature squeezing: Detecting adversarial examples in deep neural networks

Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. InProceedings 2018 Network and Distributed System Security Symposium, 2018

2018

[51] [51]

Wide residual networks

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. InBritish Machine Vision Conference, pages 1–15, 2016

2016

[52] [52]

Theoretically principled trade-off between robustness and accuracy

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. InInternational conference on machine learning, pages 7472–7482, 2019

2019

[53] [53]

On the duality between sharpness-aware minimization and adversarial training

Yihao Zhang, Hangzhou He, Jingyu Zhu, Huanran Chen, Yifei Wang, and Zeming Wei. On the duality between sharpness-aware minimization and adversarial training. InInternational Conference on Machine Learning, pages 59024–59041, 2024

2024

[54] [54]

Sharpness-aware minimization efficiently selects flatter minima late in training

Zhanpeng Zhou, Mingze Wang, Yuchen Mao, Bingrui Li, and Junchi Yan. Sharpness-aware minimization efficiently selects flatter minima late in training. InICLR, 2025. 13 Algorithm 2One Muon Step under AT (expands line 10 of Alg. 1 when Opt = Muon; Newton– Schulz iteration inlined as sub-steps of S2). Require: Weight block W∈R m×n, momentum buffer M (init. 0)...

2025

[55] [55]

Jacobian scale

Hence ∇xmy,j(x) =J gθ(x)⊤(ey −e j),(21) and therefore, it satisfies ∥∇xmy,j(x)∥2 ≤ √ 2∥J gθ(x)∥2 ≤ √ 2 LY ℓ=1 ∥Wℓ∥2,(22) where the last inequality follows from Proposition 2.2 assuming 1-Lipschitz activations. Otherwise, the product is replaced byL L−1 ϕ Q ℓ ∥Wℓ∥2. Since the multiclass margin can be written as mθ(x, y) =g θ(x)y −max j̸=y gθ(x)j = min j̸=y...

work page arXiv 1943