pith. machine review for the scientific record. sign in

arxiv: 2605.09861 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Flag Varieties: A Geometric Framework for Deep Network Alignment

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords deep network alignmentflag varietiesgeometric invariant theoryneural collapsesubspace metricsridge regularizationcommutator obstruction
0
0 comments X

The pith

Deep network alignment geometry is canonically a flag variety whose unique reparameterization-invariant observable is subspace intersection dimension.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives the mathematical structure demanded by layerwise alignment in deep networks rather than fitting explanations to existing observations. It applies geometric invariant theory to show that the space of compatible weight matrices contains a distinguished closed polystable stratum realized by a flag variety, and that the dimension of subspace intersections is the only quantity preserved under changes of basis. If this holds, then commonly used subspace similarity measures are necessities imposed by invariance rather than choices, and the effects of regularization and activation functions become predictable dynamical consequences of the same structure. The account also supplies a direct geometric origin for the observed hierarchy of neural collapse levels.

Core claim

Using geometric invariant theory, alignment geometry has a canonical closed, polystable stratum given by a flag variety, and subspace intersection dimension is its unique reparameterization-invariant observable. Ridge regularization therefore drives subspace alignment at an exponential rate determined by the weight-decay coefficient, while nonlinear activations produce a commutator obstruction to exact basis alignment that is absent in linear networks. The same invariant structure accounts for the Level-2/3 hierarchy in neural collapse from first principles and yields weight-space diagnostics, such as commutator magnitude and head-subspace overlap, that require no forward passes.

What carries the argument

The flag variety realized as the canonical closed polystable stratum of the alignment geometry under geometric invariant theory, which isolates subspace intersection dimension as the sole invariant.

If this is right

  • Ridge regularization produces exponential subspace alignment whose rate is set by the weight-decay parameter.
  • Nonlinear activations create a commutator obstruction to exact basis alignment that is generically absent in linear networks.
  • The Level-2/3 hierarchy in neural collapse receives a geometric explanation directly from the invariant structure rather than post-hoc fitting.
  • Commutator magnitude and head-subspace overlap function as internal diagnostics for alignment that need no forward passes through the network.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Linear networks should reach exact basis alignment under sufficient regularization, whereas nonlinear networks retain a residual commutator that scales with activation strength.
  • The same invariant diagnostics could be monitored during training to detect when alignment has stabilized without requiring separate evaluation data.
  • The approach of extracting unique invariants from polystable strata could be applied to other network structures such as attention layers or residual blocks to derive analogous alignment laws.

Load-bearing premise

Alignment phenomena in deep networks are assumed to be governed by the polystable strata that geometric invariant theory identifies in the space of weight matrices.

What would settle it

An experiment that reparameterizes a trained network while keeping its input-output map fixed and measures whether subspace intersection dimension changes, or that tracks alignment speed under varying weight decay and checks whether the observed rate matches the exponential prediction.

Figures

Figures reproduced from arXiv: 2605.09861 by Cihan Ruan, Jingchuan Xiao, Xinyi Sui.

Figure 1
Figure 1. Figure 1: Commutator ∥[D2 , PV ]∥F across models. Linear network vanishes exactly; non￾linear networks show consistent nonzero ob￾struction. Error bars: mean ± std. Results. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Layerwise head-subspace overlap R(ℓ) . Pretrained GPT-2 and Llama stay above the Haar-random baseline across all layers. constraints across layers. The overlap R(ℓ) thus serves as a geometry-aware diagnostic: it is exactly zero when Level-3 alignment is theoretically attainable, and robustly nonzero when it is not—making the invisible obstruction visible. We interpret this as a static terminal diagnostic c… view at source ↗
read the original abstract

Alignment, the tendency of adjacent weight matrices in deep networks to develop compatible subspace orientations, underlies gradient flow, Neural Collapse, and representation similarity across architectures. Despite extensive empirical documentation, these phenomena have resisted unified theoretical treatment: existing explanations are post-hoc, each fitted to a specific observation with whatever mathematics is at hand. We reverse this direction by deriving the mathematical structure that layerwise alignment inherently demands. Using geometric invariant theory, we prove that alignment geometry has a canonical closed, polystable stratum given by a flag variety, and that subspace intersection dimension is its unique reparameterization-invariant observable, establishing that subspace metrics are not empirical conventions but mathematical necessities. This unified framework yields two dynamical consequences: ridge regularization drives subspace alignment at an exponential rate set by weight decay, whereas nonlinear activations induce a commutator obstruction to exact basis alignment, generically present in nonlinear networks and absent in linear ones. Together these give a geometric explanation of the Level-2/3 hierarchy in Neural Collapse from first principles rather than post-hoc analysis. The commutator magnitude and head subspace overlap further serve as weight-space windows into internal alignment structure, requiring no forward passes. Experiments on multilayer perceptrons, residual networks, and pretrained language models support the proposed diagnostics and delineate their scope.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper claims that alignment phenomena in deep networks are governed by a geometric structure derived from geometric invariant theory (GIT): the space of aligned weight matrices admits a canonical closed polystable stratum that is a flag variety, with the dimension of subspace intersections serving as the unique reparameterization-invariant observable. From this, the authors derive two dynamical consequences—exponential alignment under ridge regularization and a commutator obstruction induced by nonlinear activations—and use these to explain the Level-2/3 hierarchy in Neural Collapse. The framework is supported by experiments on MLPs, ResNets, and pretrained language models that introduce diagnostics based on commutator magnitude and head-subspace overlap.

Significance. If the central GIT identification holds, the work supplies a first-principles derivation of alignment geometry rather than post-hoc fitting, potentially unifying observations across gradient flow, Neural Collapse, and representation similarity. The explicit link between ridge regularization and exponential rates, together with the commutator obstruction distinguishing linear from nonlinear networks, offers falsifiable predictions and new weight-space diagnostics that require no forward passes. The experimental validation across multiple architectures strengthens the claim that subspace intersection dimension is not an arbitrary metric but a mathematical necessity.

major comments (3)
  1. [§4] §4 (GIT Setup and Polystable Stratum): The manuscript asserts that the natural action of GL(n) × GL(m) on pairs of weight matrices yields a GIT quotient whose closed polystable locus is canonically a flag variety with intersection dimension as the sole invariant. However, the precise linearization and stability condition are not shown to be forced by reparameterization invariance of the network loss alone; standard GIT quotients for matrix pairs produce determinantal or Segre varieties rather than flag varieties. This step is load-bearing for the claim that subspace metrics are mathematical necessities rather than modeling choices.
  2. [§5.2] §5.2 (Dynamical Consequences): The derivation that ridge regularization produces exponential alignment rates relies on the polystable stratum being the relevant attractor under gradient flow. If the GIT identification in §4 is an additional modeling assumption rather than derived, the exponential-rate claim reduces to a consequence of the chosen stability condition and does not follow from reparameterization invariance.
  3. [§6] §6 (Commutator Obstruction): The claim that nonlinear activations generically induce a commutator obstruction to exact basis alignment, absent in linear networks, is presented as a direct consequence of the flag-variety geometry. The manuscript does not provide an explicit algebraic computation showing that the invariant ring is generated solely by Plücker coordinates of intersections, which would be required to establish uniqueness of the observable.
minor comments (3)
  1. [Abstract, §1] The abstract and introduction repeatedly use the phrase 'mathematical necessities' before the GIT derivation is complete; this phrasing should be qualified until after the load-bearing identification is established.
  2. [Figure 3] Figure 3 (commutator magnitude plots) lacks error bars or multiple random seeds; the visual separation between linear and nonlinear cases would be more convincing with statistical quantification.
  3. [§4] Notation for the flag variety and its Plücker embedding is introduced without a self-contained definition or reference to standard texts (e.g., Fulton & Harris); a brief appendix recalling the relevant GIT facts would improve accessibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify areas where the manuscript would benefit from additional explicit derivations. We have revised the paper to supply these details while preserving the original claims, which we continue to regard as following directly from the GIT analysis.

read point-by-point responses
  1. Referee: [§4] §4 (GIT Setup and Polystable Stratum): The manuscript asserts that the natural action of GL(n) × GL(m) on pairs of weight matrices yields a GIT quotient whose closed polystable locus is canonically a flag variety with intersection dimension as the sole invariant. However, the precise linearization and stability condition are not shown to be forced by reparameterization invariance of the network loss alone; standard GIT quotients for matrix pairs produce determinantal or Segre varieties rather than flag varieties. This step is load-bearing for the claim that subspace metrics are mathematical necessities rather than modeling choices.

    Authors: We agree that the original text did not spell out the linearization step with sufficient detail. In the revision we add a dedicated subsection that starts from the requirement that the network loss L(W1, W2) be invariant under simultaneous basis change (W1, W2) ↦ (g W1 h^{-1}, h W2 k^{-1}). This invariance forces a specific linearization of the GL(n) × GL(m) action on the space of matrix pairs; the associated moment map and Hilbert-Mumford numerical function then select precisely the flag variety stratified by intersection dimension as the closed polystable locus. The determinantal and Segre varieties arise only under different linearizations (e.g., the standard Plücker embedding for rank conditions). The revised argument therefore derives the flag-variety structure from reparameterization invariance rather than imposing it as an extra modeling choice. revision: yes

  2. Referee: [§5.2] §5.2 (Dynamical Consequences): The derivation that ridge regularization produces exponential alignment rates relies on the polystable stratum being the relevant attractor under gradient flow. If the GIT identification in §4 is an additional modeling assumption rather than derived, the exponential-rate claim reduces to a consequence of the chosen stability condition and does not follow from reparameterization invariance.

    Authors: The revised §5.2 now contains a short lemma showing that the same linearization derived in the new §4 makes the polystable stratum the unique closed attractor of the regularized gradient flow. Because the ridge term is the squared norm of the moment map for this linearization, the flow decreases the loss while driving the pair (W1, W2) toward the polystable locus at a rate controlled by the decay coefficient. The exponential convergence is therefore a direct consequence of the invariance-derived GIT structure, not an independent assumption. revision: yes

  3. Referee: [§6] §6 (Commutator Obstruction): The claim that nonlinear activations generically induce a commutator obstruction to exact basis alignment, absent in linear networks, is presented as a direct consequence of the flag-variety geometry. The manuscript does not provide an explicit algebraic computation showing that the invariant ring is generated solely by Plücker coordinates of intersections, which would be required to establish uniqueness of the observable.

    Authors: We have inserted an explicit algebraic proposition in the revised §6. Using the fact that the GIT quotient is geometric and that the flag variety is embedded via its Plücker coordinates on the intersection subspaces, we prove that every invariant polynomial is a polynomial in these coordinates. The proof proceeds by showing that any higher-degree invariant can be reduced via the relations coming from the flag manifold’s cohomology ring. This establishes both the uniqueness of the intersection-dimension observable and the generic non-vanishing of the commutator for nonlinear activations (which move the pair off the linear subspace where the commutator vanishes). revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies external GIT to reparameterization invariants without reduction to inputs.

full rationale

The paper's central claim derives the flag variety as the canonical polystable stratum for alignment geometry via geometric invariant theory, treating subspace intersection dimension as the unique reparameterization-invariant observable. This structure is presented as following from the group action on weight matrices rather than being presupposed by the alignment definition or fitted to data. Dynamical consequences (exponential rates under ridge regularization, commutator obstruction) are derived as outputs of the framework, not inputs. No self-citations are load-bearing for the uniqueness or canonicity statements, and no equations reduce a prediction to a fitted parameter by construction. The derivation remains self-contained against the external GIT benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on applying geometric invariant theory to weight matrix spaces in deep networks, assuming the polystable stratum captures alignment dynamics. No explicit free parameters are introduced in the abstract, but the choice of flag variety as the canonical structure is a domain-specific modeling decision.

axioms (2)
  • domain assumption Geometric invariant theory applies directly to the space of adjacent weight matrices in deep networks to determine stable alignment strata
    Invoked to prove the canonical closed polystable stratum is a flag variety.
  • ad hoc to paper Subspace intersection dimension is the unique reparameterization-invariant observable under the alignment geometry
    Central to establishing subspace metrics as mathematical necessities rather than conventions.

pith-pipeline@v0.9.0 · 5523 in / 1600 out tokens · 43297 ms · 2026-05-12T04:15:40.555348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

  1. [1]

    Absil, Robert Mahony, and Rodolphe Sepulchre.Optimization Algorithms on Matrix Manifolds

    P.-A. Absil, Robert Mahony, and Rodolphe Sepulchre.Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008

  2. [2]

    Springer, 1991

    Armand Borel.Linear Algebraic Groups. Springer, 1991

  3. [3]

    Predicting parameters in deep learning

    Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando de Freitas. Predicting parameters in deep learning. InAdvances in Neural Information Processing Systems, pages 2148–2156, 2013. 9

  4. [4]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. URL https: //openreview.net/forum?id=YicbFdNTTy

  5. [5]

    The Llama 3 Herd of Models

    Abhimanyu Dubey et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URL https: //arxiv.org/abs/2407.21783

  6. [6]

    Understanding the difficulty of training deep feedforward neural networks

    Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, pages 249–256, 2010

  7. [7]

    Jihun Ham and Daniel D. Lee. Grassmann discriminant analysis: A unifying view on subspace-based learning. In Proceedings of the 25th International Conference on Machine Learning, pages 376–383, 2008

  8. [8]

    Finite depth and width corrections to the neural tangent kernel

    Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. InICLR, 2020

  9. [9]

    Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. InProceedings of the IEEE International Conference on Computer Vision, pages 1026–1034, 2015

  10. [10]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016

  11. [11]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022

  12. [12]

    Gradient descent aligns the layers of deep linear networks

    Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. InICLR, 2019

  13. [13]

    Springer, 1995

    Tosio Kato.Perturbation Theory for Linear Operators. Springer, 1995

  14. [14]

    Similarity of neural network representations revisited

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InICML, pages 3519–3529, 2019

  15. [15]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto,

  16. [16]

    URLhttps://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

  17. [17]

    A ConvNet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A ConvNet for the 2020s. InCVPR, pages 11966–11976, 2022. URL https://openaccess.thecvf.com/content/ CVPR2022/html/Liu_A_ConvNet_for_the_2020s_CVPR_2022_paper.html

  18. [18]

    Are sixteen heads really better than one? InAdvances in Neural Information Processing Systems, pages 14014–14024, 2019

    Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? InAdvances in Neural Information Processing Systems, pages 14014–14024, 2019

  19. [19]

    Mixon, Hans Parshall, and Jianzong Pi

    Dustin G. Mixon, Hans Parshall, and Jianzong Pi. Neural collapse with unconstrained features.Sampling Theory, Signal Processing, and Data Analysis, 20(2):11, 2022

  20. [20]

    Insights on representational similarity in neural networks with canonical correlation

    Ari Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neural networks with canonical correlation. InAdvances in Neural Information Processing Systems, pages 5732–5741, 2018

  21. [21]

    Springer, 1994

    David Mumford, John Fogarty, and Frances Kirwan.Geometric Invariant Theory. Springer, 1994

  22. [22]

    Vardan Papyan, X. Y . Han, and David L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training.PNAS, 117(40):24652–24663, 2020

  23. [23]

    Resurrecting the sigmoid in deep learning through dynamical isometry

    Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning through dynamical isometry. InAdvances in Neural Information Processing Systems, pages 4785–4795, 2017

  24. [24]

    The emergence of spectral universality in deep networks

    Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. The emergence of spectral universality in deep networks. InAISTATS, pages 1924–1932, 2018

  25. [25]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019. 10

  26. [26]

    SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability

    Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. InAdvances in Neural Information Processing Systems, pages 6076–6085, 2017

  27. [27]

    Akshay Rangamani, Marius Lindegaard, Tomer Galanti, and Tomaso A. Poggio. Feature learning in deep classifiers through intermediate neural collapse. InICML, pages 28729–28745, 2023

  28. [28]

    Saxe, James L

    Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. InICLR, 2014

  29. [29]

    Dynamical isometry is achieved in residual networks in a universal way for any activation function

    Wojciech Tarnowski, Piotr Warchoł, Stanisław Jastrze ¸bski, Jacek Tabor, and Maciej Nowak. Dynamical isometry is achieved in residual networks in a universal way for any activation function. InProceedings of the Twenty- Second International Conference on Artificial Intelligence and Statistics, pages 2221–2230, 2019. URL https: //proceedings.mlr.press/v89/...

  30. [30]

    Ilya O. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy. MLP-Mixer: An all-MLP architecture for vision. InAdvances in Neural Information Processing Sys- tems, pages 24261–24272, 2021. URL https://proceedings.ne...

  31. [31]

    doi: 10.18653/v1/P19-1580

    Elena V oita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5797–5808, 2019. doi: 10.18653/v1/P19-1580. URL https://aclanthology.org/P19-1580/

  32. [32]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

  33. [33]

    Measuring affinity between attention-head weight subspaces via the projection kernel.arXiv preprint arXiv:2601.10266, 2026

    Hiroaki Yamagiwa, Yusuke Takase, and Hidetoshi Shimodaira. Measuring affinity between attention-head weight subspaces via the projection kernel.arXiv preprint arXiv:2601.10266, 2026. URL https://arxiv.org/ abs/2601.10266

  34. [34]

    Neural collapse with normalized features: A geometric analysis over the Riemannian manifold

    Can Yaras, Peng Wang, Zhihui Zhu, Laura Balzano, and Qing Qu. Neural collapse with normalized features: A geometric analysis over the Riemannian manifold. InAdvances in Neural Information Processing Systems, pages 11547–11560, 2022

  35. [35]

    A geometric analysis of neural collapse with unconstrained features

    Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu. A geometric analysis of neural collapse with unconstrained features. InAdvances in Neural Information Processing Systems, pages 29820–29834, 2021. 11 Supplementary Information A Proofs for Flag Varieties A.1 Proof of Proposition 3.3 (Dimensional Reduction) Proof. The in...

  36. [36]

    The space admits a stratification by relative position of(V, U), indexed bydim(V∩U)

  37. [37]

    The locus Fr1,r2 :={(V, U) :V⊆U}(B.1) is a closed subvariety canonically isomorphic to the partial flag varietyFl(r 1, r2;d). 12

  38. [38]

    Every lower-incidence stratum admits a one-parameter degeneration whose limit lies inF r1,r2, and hence is not polystable

    The locus Fr1,r2 is the unique polystable stratum. Every lower-incidence stratum admits a one-parameter degeneration whose limit lies inF r1,r2, and hence is not polystable. In particular, flag geometry is the unique geometrically stable configuration at the layerwise level. Proof.We consider the natural action ofG:= GL(d)on X:= Gr(r 1, d)×Gr(r 2, d),(B.2...

  39. [39]

    Stratification by relative position.The G-orbits in X are classified by the relative position of the pair (V, U) , equivalently by the integer k:= dim(V∩U),0≤k≤r 1.(B.3) This yields a finite stratification ofXinto locally closed subsets X= r1G k=0 Xk, X k :={(V, U) : dim(V∩U) =k}.(B.4) This is standard: GL(d) acts transitively on pairs of subspaces with f...

  40. [40]

    It is Zariski closed: the condition V⊆U is equivalent to the vanishing of the natural mapV→R d/U, hence is given by rank equations on the product of Grassmannians

    The flag locus as a closed subvariety.The locus Fr1,r2 :={(V, U)∈X:V⊆U}(B.5) coincides with the stratum Xr1. It is Zariski closed: the condition V⊆U is equivalent to the vanishing of the natural mapV→R d/U, hence is given by rank equations on the product of Grassmannians. Moreover, the map (V, U)7− →(V⊂U⊂R d) identifiesF r1,r2 canonically with the partial...

  41. [41]

    flag-style

    Uniqueness of the flag stratum.Step 1 (Non-polystability of non-flag strata).Let (V, U)∈X k with k < r 1, where Xk ={(V, U) : dim(V∩U) =k}.(B.6) Sets:=r 1 −k. Sincedim(V+U) =r 1 +r 2 −k=r 2 +s≤d, after a change of basis we may write U= span(e 1, . . . , er2),(B.7) and V= span e1, . . . , ek, e r2+1 +e k+1, . . . , er2+s +e k+s .(B.8) Thendim(V∩U) =k. We u...

  42. [42]

    We do not have training checkpoints and cannot verify whether R(ℓ) increases monotonically during training

    These areterminal-stateobservations only. We do not have training checkpoints and cannot verify whether R(ℓ) increases monotonically during training

  43. [43]

    A random model trained on a simpler task might also show excess overlap

    The Haar-random baseline controls for subspace dimension but not for the specific structure of transformer training objectives. A random model trained on a simpler task might also show excess overlap. 3.R (ℓ) aggregates pairwise overlaps and cannot distinguish a configuration where all heads partially overlap from one where a subset of heads are near-iden...