pith. sign in

arxiv: 2606.13276 · v1 · pith:LAVMIUQXnew · submitted 2026-06-11 · 💻 cs.LG · cs.AI

Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization

Pith reviewed 2026-06-27 07:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords manifold constraintstransformer optimizationStiefel geometryDGram geometryweight-space geometryattention layersMLP layersGPT-2 pretraining
0
0 comments X

The pith

Attention layers favor Stiefel geometry while MLP layers favor DGram geometry for stable transformer optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether attention and MLP modules in transformers benefit from different geometric constraints on their weights. Using the Manifold Muon optimizer on GPT-2 pretraining, it assigns Stiefel or DGram manifolds layer-wise and measures training behavior. The best results come from Stiefel on attention and DGram on MLP; the reverse and all-DGram cases become unstable because DGram on attention produces growing singular values that saturate the softmax. A sympathetic reader would care because uniform manifold constraints may be limiting progress in geometry-aware optimization of large models.

Core claim

Constraining attention layers with Stiefel geometry while assigning DGram geometry to MLP layers gives the best performance among the tested configurations, whereas the inverted assignment and all-DGram configuration become unstable under the shared hyperparameter setting. This failure traces to singular value growth in DGram-constrained attention weights, which amplifies attention logits and induces softmax saturation.

What carries the argument

Module-wise assignment of Stiefel and DGram manifold constraints inside the Manifold Muon optimizer.

If this is right

  • Uniform manifold constraints across attention and MLP blocks can produce instability.
  • Stiefel geometry on attention layers prevents singular-value blowup and keeps training stable.
  • DGram geometry on MLP layers supports effective optimization when attention uses Stiefel.
  • Symmetry-aware optimization for transformers should be chosen per module type rather than applied uniformly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar module-specific geometry preferences may appear in other transformer variants or larger models.
  • An optimizer that learns or selects the manifold per layer could outperform fixed assignments.
  • The same asymmetry might affect fine-tuning or other downstream tasks beyond pretraining.

Load-bearing premise

The shared hyperparameter setting is fair for comparison and any instability comes specifically from singular-value growth in DGram-constrained attention weights.

What would settle it

Training the all-DGram configuration with per-module hyperparameter sweeps and checking whether singular values still grow and cause softmax saturation.

Figures

Figures reproduced from arXiv: 2606.13276 by Kirato Yoshihara.

Figure 1
Figure 1. Figure 1: Validation loss curves for the five layer-wise manifold assignments. Left: full training trajectory, including unstable runs. Right: zoomed view of the stable configurations in the late-training regime. HETERO achieves the lowest stable validation loss, while both configurations that assign DGram to attention, ALL-DGRAM and HETERO-INV, become unstable. This suggests that, under this optimizer setting, atte… view at source ↗
Figure 2
Figure 2. Figure 2: Spectral evolution during training. Each curve reports the maximum singular value over all matrices in the corresponding parameter group across layers. Left: attention weights. DGram attention exhibits explosive singular value growth in ALL-DGRAM and HETERO-INV under the shared hyperparameter setting. Right: MLP/FFN weights. DGram is compatible with MLP/FFN weights and grows without becoming unstable when … view at source ↗
Figure 3
Figure 3. Figure 3: Schematic of the HETERO manifold assignment in GPT-2. Within each of the 12 Transformer blocks, attention projections are updated by Stiefel Manifold Muon (orange) and MLP/FFN projections by DGram Manifold Muon (blue). All remaining parameters—input embedding, positional encoding, final linear layer, and normalization scales—are optimized with AdamW (gray). DGram imposes a weaker Gram constraint. In our im… view at source ↗
read the original abstract

Weight-space geometry plays a central role in neural network optimization, yet manifold constraints are often applied uniformly across all weight matrices. In this work, we ask whether different transformer modules prefer different manifold geometries. We study Manifold Muon for GPT-2 pretraining and compare layer-wise assignments of Stiefel and DGram constraints across attention and MLP blocks. Our results show a clear asymmetry: constraining attention layers with Stiefel geometry while assigning DGram geometry to MLP layers gives the best performance among the tested configurations, whereas the inverted assignment and all-DGram configuration become unstable under the shared hyperparameter setting. We trace this failure to singular value growth in DGram-constrained attention weights, which can amplify attention logits and induce softmax saturation. These findings suggest that symmetry-aware and geometry-aware optimization for transformers should be module-specific rather than uniform.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript claims that different transformer modules prefer different manifold geometries during optimization. Using Manifold Muon for GPT-2 pretraining, it compares layer-wise assignments of Stiefel and DGram constraints to attention and MLP blocks. The results indicate an asymmetry: Stiefel on attention layers combined with DGram on MLP layers yields the best performance, while the inverted assignment and all-DGram configurations become unstable under a shared hyperparameter setting. The instability is traced to singular-value growth in DGram-constrained attention weights, which amplifies attention logits and induces softmax saturation. The work concludes that symmetry-aware and geometry-aware optimization should be module-specific rather than uniform.

Significance. If the reported performance ordering and mechanistic attribution hold under controlled comparisons, the result would establish that uniform manifold constraints are suboptimal for transformers and that module-specific geometry choices can improve stability and final performance. The explicit link between singular-value growth and softmax saturation supplies a concrete, testable mechanism that could inform the design of future constrained optimizers.

major comments (2)
  1. [Abstract] Abstract and results: the central claim of a clear performance ordering and instability under the shared hyperparameter setting is stated without quantitative metrics, error bars, statistical tests, or details on how singular values were measured, so the ordering cannot be verified from the given text.
  2. [Results / Experimental Setup] Experimental comparison: the claim that inverted and all-DGram configurations are intrinsically unstable due to singular-value growth in attention weights rests on a single shared hyperparameter regime; no ablation is reported on learning-rate sensitivity or initialization scale for the DGram-attention case, leaving open the possibility that modest hyperparameter adjustment could stabilize those configurations without harming the best one.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful feedback on our work regarding module-wise manifold constraints in transformer optimization. We provide point-by-point responses to the major comments and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract and results: the central claim of a clear performance ordering and instability under the shared hyperparameter setting is stated without quantitative metrics, error bars, statistical tests, or details on how singular values were measured, so the ordering cannot be verified from the given text.

    Authors: The abstract is intentionally concise and does not include detailed metrics. The full paper presents the performance ordering with quantitative results, including error bars from multiple random seeds and statistical comparisons in the results section. Details on singular value measurement are provided in the experimental methodology. We will revise the abstract to incorporate key quantitative metrics supporting the claims and ensure the singular value analysis is clearly described. revision: yes

  2. Referee: [Results / Experimental Setup] Experimental comparison: the claim that inverted and all-DGram configurations are intrinsically unstable due to singular-value growth in attention weights rests on a single shared hyperparameter regime; no ablation is reported on learning-rate sensitivity or initialization scale for the DGram-attention case, leaving open the possibility that modest hyperparameter adjustment could stabilize those configurations without harming the best one.

    Authors: We would like to clarify that the manuscript states the configurations 'become unstable under the shared hyperparameter setting,' not that they are intrinsically unstable. The use of a shared hyperparameter regime is deliberate to compare the configurations on equal footing. To further address the concern about sensitivity, we will include additional experiments ablating the learning rate and initialization scale for the DGram-constrained attention layers in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical comparison with no derivations or self-referential predictions

full rationale

The paper reports experimental results comparing Stiefel and DGram manifold constraints assigned to attention vs. MLP layers in GPT-2 pretraining. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described content. Claims rest on observed performance and instability under a shared hyperparameter regime, which is an empirical observation rather than a reduction to inputs by construction. The skeptic concern about hyperparameter fairness is a methodological question, not evidence of circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract describes an empirical comparison without introducing mathematical axioms, free parameters, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5667 in / 1016 out tokens · 17279 ms · 2026-06-27T07:26:40.085499+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    2024 , url =

    Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

  2. [2]

    Thinking Machines Lab: Connectionism , year =

    Jeremy Bernstein , title =. Thinking Machines Lab: Connectionism , year =

  3. [3]

    2025 , howpublished =

    Keigwin, Ben and Pai, Dhruv and Chen, Nathan , title =. 2025 , howpublished =

  4. [4]

    2025 , url =

    Franz Louis Cesista , title =. 2025 , url =

  5. [5]

    Muon is Scalable for LLM Training

    Muon is scalable for llm training , author=. arXiv preprint arXiv:2502.16982 , year=

  6. [6]

    International Conference on Learning Representations , year =

    Ilya Loshchilov and Frank Hutter , title =. International Conference on Learning Representations , year =

  7. [7]

    Advances in Neural Information Processing Systems , year=

    Attention Is All You Need , author=. Advances in Neural Information Processing Systems , year=

  8. [8]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  9. [9]

    Proceedings of the AAAI Conference on Artificial Intelligence , year=

    Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=

  10. [10]

    Proceedings of the AAAI Conference on Artificial Intelligence , year=

    Feedback gradient descent: Efficient and stable optimization with orthogonality for DNNs , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=

  11. [11]

    arXiv preprint arXiv:2601.21487 , year=

    Manifold constrained steepest descent , author=. arXiv preprint arXiv:2601.21487 , year=

  12. [12]

    International Conference on Learning Representations , year=

    The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks , author=. International Conference on Learning Representations , year=

  13. [13]

    International Conference on Learning Representations , year=

    Git Re-Basin: Merging Models modulo Permutation Symmetries , author=. International Conference on Learning Representations , year=

  14. [14]

    2022 , howpublished =

    Karpathy, Andrej , title =. 2022 , howpublished =

  15. [15]

    OpenWebText Corpus , author=

  16. [16]

    arXiv preprint arXiv:2601.23000 , year=

    Mano: Restriking Manifold Optimization for LLM Training , author=. arXiv preprint arXiv:2601.23000 , year=

  17. [17]

    Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054, 2025

    Muon optimizes under spectral norm constraints , author=. arXiv preprint arXiv:2506.15054 , year=