pith. machine review for the scientific record. sign in

arxiv: 2605.09991 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.LG· math.OC

Recognition: 2 theorem links

· Lean Theorem

Optimizer-Induced Mode Connectivity: From AdamW to Muon

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:34 UTC · model grok-4.3

classification 💻 cs.AI cs.LGmath.OC
keywords mode connectivityimplicit regularizationloss landscapeoptimizersReLU networksAdamWMuontransformers
0
0 comments X

The pith

Solutions from one optimizer form a connected set at large width in two-layer ReLU networks due to implicit regularization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how optimizer choice shapes connectivity among low-loss solutions. For two-layer ReLU networks it proves that all solutions reached by any single optimizer, such as AdamW or Muon, lie on one connected component once the width is large enough. Different optimizers can produce either overlapping or disjoint regions depending on regularization strength, and small-width examples exhibit a provable loss barrier between AdamW and Muon zero-loss solutions. In GPT-2 pretraining, paths staying within one optimizer preserve the model spectrum while paths that cross optimizers produce smooth transitions. These findings show that the solution space is partitioned by optimizer-specific structure rather than being uniformly connected.

Core claim

For two-layer ReLU networks, solutions from a single optimizer in the Lion-K family form a connected set at sufficiently large width. Optimizer-induced regions can be disjoint or overlap at large width depending on regularization, while at small width AdamW and Muon reach disconnected zero-loss components separated by a provable loss barrier. In GPT-2 pretraining, same-optimizer paths preserve each model's spectrum and cross-optimizer paths traverse a smooth transition.

What carries the argument

Optimizer-induced implicit regularization that restricts solutions to connected regions within each optimizer's reachable set.

Load-bearing premise

That the implicit regularization imposed by each optimizer is strong enough to force all its solutions into one connected component once the network width is large.

What would settle it

Identifying two AdamW solutions on a sufficiently wide two-layer ReLU network whose connecting paths all encounter a positive loss barrier.

Figures

Figures reproduced from arXiv: 2605.09991 by Erica Zhang, Fangzhao Zhang, Mert Pilanci, Sungyoon Kim, Yiqi Jiang.

Figure 1
Figure 1. Figure 1: Motivating Experiment. Singular value histograms for layer 1 up_proj weights in GPT-2 training. Models trained with AdamW and Muon converge to solutions with distinct spectrum. Solutions obtained by the same optimizer can be interpolated by paths that preserve the spectrum. Naive interpolation does not connect models with low loss paths. See Section 4.2 for experimental details. open whether the two endpoi… view at source ↗
Figure 2
Figure 2. Figure 2: Mode Connectivity Under Optimizer-Induced Constraints. Solid line denotes low-loss path, dashed line denotes path with a barrier. (a) Classical mode connectivity treats low-loss solutions as lying in a connected set. (b) Regularized solution sets induced by each optimizer are connected for sufficiently wide networks. The regions may or may not intersect, depending on the problem and regularization. (c) In … view at source ↗
Figure 3
Figure 3. Figure 3: Spectrum along same-optimizer connectivity paths. Each panel shows the singular value [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Spectrum along the AdamW→Muon connectivity path. Each panel shows the singular value histogram at a different interpolation coefficient. We next connect a model trained with AdamW to one trained with Muon. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Loss barrier along the AdamW→Muon cross-optimizer path, evaluated in-distribution on enwik8 (left) and out-of-distribution on Stories (right). spectral diversity along cross-optimizer paths for more effective model merging (Stoica et al., 2024; Crisostomi et al., 2024) or continual learning is a promising practical direction. Acknowledgments and Disclosure of Funding This work was supported in part by the … view at source ↗
Figure 6
Figure 6. Figure 6: Spectrum along the AdamW→Muon connectivity path. Each panel shows the singular value histogram at a different interpolation coefficient t. θC trained with Muon [PITH_FULL_IMAGE:figures/full_fig_p037_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Out-of-distribution experiments for additional datasets. [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗
read the original abstract

Mode connectivity has been widely studied, yet the role of the optimizer remains underexplored. We revisit it through optimizer-induced implicit regularization, asking how connectivity behaves when restricted to solutions constrained by a given optimizer. For two-layer ReLU networks, we show that solutions from a single optimizer -- AdamW, Muon, or others in the Lion-$\mathcal{K}$ family -- form a connected set at sufficiently large width, a result not implied by prior work. We then characterize how optimizer-induced regions interact: at large width two different regions can be disjoint or overlap depending on regularization, while in our small-width example AdamW and Muon converge to disconnected zero-loss components separated by a provable loss barrier. Empirically, in GPT-2 pretraining, we observe same-optimizer paths preserve each model's spectrum while cross-optimizer paths traverse a smooth transition. Our results reveal optimizer-dependent structure beyond classical mode connectivity literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that optimizer-induced implicit regularization structures the loss landscape such that, for two-layer ReLU networks, solutions obtained from any single optimizer (AdamW, Muon, or members of the Lion-K family) form a connected set at sufficiently large width. It further shows that regions induced by different optimizers may be disjoint or overlap depending on regularization strength, with a concrete small-width example in which AdamW and Muon zero-loss solutions are separated by a provable loss barrier. Empirically, linear paths between GPT-2 models trained with the same optimizer preserve spectral properties, while cross-optimizer paths exhibit smooth transitions.

Significance. If the central claims hold, the work supplies a new axis for mode-connectivity analysis by tying connectivity to the implicit bias of concrete optimizers rather than to the loss alone. The two-layer ReLU results are not implied by prior connectivity theorems, and the GPT-2 observations indicate that the phenomenon is observable in practical training. These contributions could inform both theoretical understanding of optimization dynamics and practical choices among optimizers.

major comments (2)
  1. [§3] §3 (two-layer ReLU connectivity): the argument that single-optimizer solution sets are connected at large width rests on the claim that implicit regularization sufficiently constrains the feasible set; the manuscript must exhibit the precise width-dependent argument and confirm that connectivity follows directly from the optimizer update rule without additional unstated assumptions.
  2. [Small-width example] Small-width example (interaction between AdamW and Muon regions): the claim of a provable loss barrier separating the two zero-loss components is load-bearing for the disconnection result; the explicit construction of the barrier and the argument ruling out any lower-loss path between the components must be supplied in full.
minor comments (2)
  1. [Abstract] The Lion-K family is referenced in the abstract without an immediate definition or citation; a short clarifying sentence in the introduction would improve readability.
  2. [Empirical GPT-2 section] GPT-2 experiments: the description of how spectra are extracted and compared along interpolation paths should include the precise metric, number of independent runs, and any statistical controls used to support the 'preserve' versus 'smooth transition' statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the positive evaluation and the recommendation for minor revision. We address the two major comments point by point below, agreeing to provide the requested clarifications and expansions in the revised manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (two-layer ReLU connectivity): the argument that single-optimizer solution sets are connected at large width rests on the claim that implicit regularization sufficiently constrains the feasible set; the manuscript must exhibit the precise width-dependent argument and confirm that connectivity follows directly from the optimizer update rule without additional unstated assumptions.

    Authors: We appreciate this feedback. The current manuscript sketches the connectivity via implicit regularization but does not fully detail the width dependence. In the revised version, we will include the precise argument showing that for widths exceeding a threshold determined by the network depth and regularization parameters, the optimizer-specific constraints define a connected component. Connectivity then follows from the convexity of the constrained set induced by the update rules, without additional assumptions. revision: yes

  2. Referee: [Small-width example] Small-width example (interaction between AdamW and Muon regions): the claim of a provable loss barrier separating the two zero-loss components is load-bearing for the disconnection result; the explicit construction of the barrier and the argument ruling out any lower-loss path between the components must be supplied in full.

    Authors: The referee is correct that the barrier claim requires a complete argument. While the example is given, the full proof ruling out lower-loss paths is only outlined. We will supply the explicit construction (including the specific small-width network weights for AdamW and Muon solutions) and the detailed argument in the revision, demonstrating that the loss must increase along any connecting path due to the mismatch in the effective regularizers. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central derivation establishes optimizer-induced mode connectivity for two-layer ReLU networks at large width via analysis of implicit regularization specific to AdamW, Muon, and Lion-K family optimizers. This is explicitly positioned as not implied by prior work, with additional characterization of inter-optimizer region interactions and empirical GPT-2 observations. No load-bearing step reduces by construction to a fitted input, self-citation chain, or renamed known result; the argument relies on new theoretical constraints and verifiable empirical paths rather than re-using quantities internal to the paper. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated. The work appears to rest on standard neural-network assumptions such as ReLU properties and width limits.

pith-pipeline@v0.9.0 · 5469 in / 1207 out tokens · 47257 ms · 2026-05-12T03:34:30.127827+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

168 extracted references · 168 canonical work pages · 2 internal anchors

  1. [1]

    2025 , eprint=

    Pre-training under infinite compute , author=. 2025 , eprint=

  2. [2]

    2017 , eprint=

    Snapshot Ensembles: Train 1, get M for free , author=. 2017 , eprint=

  3. [3]

    2015 , eprint=

    Deep Reinforcement Learning with Double Q-learning , author=. 2015 , eprint=

  4. [4]

    2018 , eprint=

    Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , author=. 2018 , eprint=

  5. [5]

    2024 , eprint=

    C^2M^3 : Cycle-Consistent Multi-Model Merging , author=. 2024 , eprint=

  6. [6]

    2024 , eprint=

    ZipIt! Merging Models from Different Tasks without Training , author=. 2024 , eprint=

  7. [7]

    2024 , eprint=

    Merging Text Transformer Models from Different Initializations , author=. 2024 , eprint=

  8. [8]

    , author=

    Revisiting Mode Connectivity in Neural Networks with Bezier Surface. , author=. ICLR , year=

  9. [9]

    2024 , eprint=

    Approaching Deep Learning through the Spectral Dynamics of Weights , author=. 2024 , eprint=

  10. [10]

    2017 , eprint=

    Sharp Minima Can Generalize For Deep Nets , author=. 2017 , eprint=

  11. [11]

    2017 , eprint=

    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , author=. 2017 , eprint=

  12. [12]

    Proceedings of the 1st NeurIPS Workshop on Symmetry and Geometry in Neural Representations , pages =

    Connectedness of loss landscapes via the lens of Morse theory , author =. Proceedings of the 1st NeurIPS Workshop on Symmetry and Geometry in Neural Representations , pages =. 2023 , editor =

  13. [13]

    2025 , eprint=

    Revisiting the Initial Steps in Adaptive Gradient Descent Optimization , author=. 2025 , eprint=

  14. [14]

    2025 , eprint=

    How does the optimizer implicitly bias the model merging loss landscape? , author=. 2025 , eprint=

  15. [15]

    2024 , eprint=

    Do Deep Neural Network Solutions Form a Star Domain? , author=. 2024 , eprint=

  16. [16]

    2023 , eprint=

    Disentangling Linear Mode-Connectivity , author=. 2023 , eprint=

  17. [17]

    2024 , eprint=

    Transformer Fusion with Optimal Transport , author=. 2024 , eprint=

  18. [18]

    2022 , eprint=

    Re-basin via implicit Sinkhorn differentiation , author=. 2022 , eprint=

  19. [19]

    2025 , eprint=

    Analysis of Linear Mode Connectivity via Permutation-Based Weight Matching: With Insights into Other Permutation Search Methods , author=. 2025 , eprint=

  20. [20]

    2023 , eprint=

    REPAIR: REnormalizing Permuted Activations for Interpolation Repair , author=. 2023 , eprint=

  21. [21]

    2025 , eprint=

    Softmax is 1/2 -Lipschitz: A tight bound across all _p norms , author=. 2025 , eprint=

  22. [22]

    2017 , eprint=

    Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , author=. 2017 , eprint=

  23. [23]

    International workshop on multiple classifier systems , pages=

    Ensemble methods in machine learning , author=. International workshop on multiple classifier systems , pages=. 2000 , organization=

  24. [24]

    2023 , eprint=

    Git Re-Basin: Merging Models modulo Permutation Symmetries , author=. 2023 , eprint=

  25. [25]

    2023 , eprint=

    Mitigating Transformer Overconfidence via Lipschitz Regularization , author=. 2023 , eprint=

  26. [26]

    2023 , eprint=

    Local Lipschitz Bounds of Deep Neural Networks , author=. 2023 , eprint=

  27. [27]

    2025 , eprint=

    Fantastic Pretraining Optimizers and Where to Find Them , author=. 2025 , eprint=

  28. [28]

    2025 , eprint=

    Training Deep Learning Models with Norm-Constrained LMOs , author=. 2025 , eprint=

  29. [29]

    2020 , eprint=

    Scaling Laws for Neural Language Models , author=. 2020 , eprint=

  30. [30]

    2025 , eprint=

    AdaMuon: Adaptive Muon Optimizer , author=. 2025 , eprint=

  31. [31]

    2024 , url =

    Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

  32. [32]

    2026 , eprint=

    PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective , author=. 2026 , eprint=

  33. [33]

    2025 , eprint=

    Kimi K2: Open Agentic Intelligence , author=. 2025 , eprint=

  34. [34]

    2025 , url =

    Jeremy Bernstein , title =. 2025 , url =

  35. [35]

    2017 , eprint=

    Topology and Geometry of Half-Rectified Network Optimization , author=. 2017 , eprint=

  36. [36]

    2013 , eprint=

    Horizontal and Vertical Ensemble with Deep Representation for Classification , author=. 2013 , eprint=

  37. [37]

    2025 , eprint=

    Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal? , author=. 2025 , eprint=

  38. [38]

    2025 , eprint=

    ASGO: Adaptive Structured Gradient Optimization , author=. 2025 , eprint=

  39. [39]

    2025 , eprint=

    The Universal Weight Subspace Hypothesis , author=. 2025 , eprint=

  40. [40]

    2023 , eprint=

    StarCoder: may the source be with you! , author=. 2023 , eprint=

  41. [41]

    arXiv preprint , url =

    Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and Hillier, Chloe and Lillicrap, Timothy P , title =. arXiv preprint , url =

  42. [42]

    2024 , eprint=

    Efficient Streaming Language Models with Attention Sinks , author=. 2024 , eprint=

  43. [43]

    2023 , eprint=

    Llemma: An Open Language Model For Mathematics , author=. 2023 , eprint=

  44. [44]

    2025 , url =

    Jianlin Su , title =. 2025 , url =

  45. [45]

    2024 , eprint=

    DataComp-LM: In search of the next generation of training sets for language models , author=. 2024 , eprint=

  46. [46]

    2021 , eprint=

    Uniform convergence may be unable to explain generalization in deep learning , author=. 2021 , eprint=

  47. [47]

    2016 , eprint=

    The AMU-UEDIN Submission to the WMT16 News Translation Task: Attention-based NMT Models as Feature Functions in Phrase-based SMT , author=. 2016 , eprint=

  48. [48]

    2022 , eprint=

    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author=. 2022 , eprint=

  49. [49]

    The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  50. [50]

    2024 , eprint=

    Flex Attention: A Programming Model for Generating Optimized Attention Kernels , author=. 2024 , eprint=

  51. [51]

    2024 , eprint=

    The Platonic Representation Hypothesis , author=. 2024 , eprint=

  52. [52]

    Advances in neural information processing systems , volume=

    Preconditioned spectral descent for deep learning , author=. Advances in neural information processing systems , volume=

  53. [53]

    Neural computation , volume=

    Flat minima , author=. Neural computation , volume=. 1997 , publisher=

  54. [54]

    2025 , eprint=

    Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts , author=. 2025 , eprint=

  55. [55]

    2026 , eprint=

    Steerable Adversarial Scenario Generation through Test-Time Preference Alignment , author=. 2026 , eprint=

  56. [56]

    2022 , eprint=

    Training Compute-Optimal Large Language Models , author=. 2022 , eprint=

  57. [57]

    2015 , eprint=

    Very Deep Convolutional Networks for Large-Scale Image Recognition , author=. 2015 , eprint=

  58. [58]

    2025 , eprint=

    How Muon's Spectral Design Benefits Generalization: A Study on Imbalanced Data , author=. 2025 , eprint=

  59. [59]

    2025 , eprint=

    Muon Outperforms Adam in Tail-End Associative Memory Learning , author=. 2025 , eprint=

  60. [60]

    2025 , eprint=

    Muon Optimizer Accelerates Grokking , author=. 2025 , eprint=

  61. [61]

    2019 , eprint=

    A Simple Method for Commonsense Reasoning , author=. 2019 , eprint=

  62. [62]

    2024 , eprint=

    Weight Scope Alignment: A Frustratingly Easy Method for Model Merging , author=. 2024 , eprint=

  63. [63]

    Proceedings of the 40th International Conference on Machine Learning , pages =

    Optimal Shrinkage for Distributed Second-Order Optimization , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

  64. [64]

    2016 , eprint=

    Pointer Sentinel Mixture Models , author=. 2016 , eprint=

  65. [65]

    2006 , howpublished =

    Hutter, Marcus , title =. 2006 , howpublished =

  66. [66]

    2015 , eprint=

    Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books , author=. 2015 , eprint=

  67. [67]

    2014 , eprint=

    One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling , author=. 2014 , eprint=

  68. [68]

    2026 , eprint=

    When do spectral gradient updates help in deep learning? , author=. 2026 , eprint=

  69. [69]

    2026 , eprint=

    Preconditioning Benefits of Spectral Orthogonalization in Muon , author=. 2026 , eprint=

  70. [70]

    2026 , eprint=

    Controlled LLM Training on Spectral Sphere , author=. 2026 , eprint=

  71. [71]

    2025 , eprint=

    Understanding Mode Connectivity via Parameter Space Symmetry , author=. 2025 , eprint=

  72. [72]

    2025 , month =

    Fantastic Pretraining Optimizers and Where to Find Them 2.1: Hyperball Optimization , author =. 2025 , month =

  73. [73]

    2025 , eprint=

    Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data , author=. 2025 , eprint=

  74. [74]

    2024 , eprint=

    Modular Duality in Deep Learning , author=. 2024 , eprint=

  75. [75]

    2024 , eprint=

    Old Optimizer, New Norm: An Anthology , author=. 2024 , eprint=

  76. [76]

    2025 , eprint=

    On Provable Benefits of Muon in Federated Learning , author=. 2025 , eprint=

  77. [77]

    2025 , eprint=

    On the Convergence of Muon and Beyond , author=. 2025 , eprint=

  78. [78]

    2025 , eprint=

    Convergence Bound and Critical Batch Size of Muon Optimizer , author=. 2025 , eprint=

  79. [79]

    2025 , eprint=

    Lions and Muons: Optimization via Stochastic Frank-Wolfe , author=. 2025 , eprint=

  80. [80]

    2025 , eprint=

    On the Convergence Analysis of Muon , author=. 2025 , eprint=

Showing first 80 references.