arxiv: 2605.09861 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Flag Varieties: A Geometric Framework for Deep Network Alignment

Jingchuan Xiao , Xinyi Sui , Cihan Ruan

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords deep network alignmentflag varietiesgeometric invariant theoryneural collapsesubspace metricsridge regularizationcommutator obstruction

0 comments

The pith

Deep network alignment geometry is canonically a flag variety whose unique reparameterization-invariant observable is subspace intersection dimension.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives the mathematical structure demanded by layerwise alignment in deep networks rather than fitting explanations to existing observations. It applies geometric invariant theory to show that the space of compatible weight matrices contains a distinguished closed polystable stratum realized by a flag variety, and that the dimension of subspace intersections is the only quantity preserved under changes of basis. If this holds, then commonly used subspace similarity measures are necessities imposed by invariance rather than choices, and the effects of regularization and activation functions become predictable dynamical consequences of the same structure. The account also supplies a direct geometric origin for the observed hierarchy of neural collapse levels.

Core claim

Using geometric invariant theory, alignment geometry has a canonical closed, polystable stratum given by a flag variety, and subspace intersection dimension is its unique reparameterization-invariant observable. Ridge regularization therefore drives subspace alignment at an exponential rate determined by the weight-decay coefficient, while nonlinear activations produce a commutator obstruction to exact basis alignment that is absent in linear networks. The same invariant structure accounts for the Level-2/3 hierarchy in neural collapse from first principles and yields weight-space diagnostics, such as commutator magnitude and head-subspace overlap, that require no forward passes.

What carries the argument

The flag variety realized as the canonical closed polystable stratum of the alignment geometry under geometric invariant theory, which isolates subspace intersection dimension as the sole invariant.

If this is right

Ridge regularization produces exponential subspace alignment whose rate is set by the weight-decay parameter.
Nonlinear activations create a commutator obstruction to exact basis alignment that is generically absent in linear networks.
The Level-2/3 hierarchy in neural collapse receives a geometric explanation directly from the invariant structure rather than post-hoc fitting.
Commutator magnitude and head-subspace overlap function as internal diagnostics for alignment that need no forward passes through the network.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Linear networks should reach exact basis alignment under sufficient regularization, whereas nonlinear networks retain a residual commutator that scales with activation strength.
The same invariant diagnostics could be monitored during training to detect when alignment has stabilized without requiring separate evaluation data.
The approach of extracting unique invariants from polystable strata could be applied to other network structures such as attention layers or residual blocks to derive analogous alignment laws.

Load-bearing premise

Alignment phenomena in deep networks are assumed to be governed by the polystable strata that geometric invariant theory identifies in the space of weight matrices.

What would settle it

An experiment that reparameterizes a trained network while keeping its input-output map fixed and measures whether subspace intersection dimension changes, or that tracks alignment speed under varying weight decay and checks whether the observed rate matches the exponential prediction.

Figures

Figures reproduced from arXiv: 2605.09861 by Cihan Ruan, Jingchuan Xiao, Xinyi Sui.

**Figure 1.** Figure 1: Commutator ∥[D2 , PV ]∥F across models. Linear network vanishes exactly; nonlinear networks show consistent nonzero obstruction. Error bars: mean ± std. Results. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗

**Figure 2.** Figure 2: Layerwise head-subspace overlap R(ℓ) . Pretrained GPT-2 and Llama stay above the Haar-random baseline across all layers. constraints across layers. The overlap R(ℓ) thus serves as a geometry-aware diagnostic: it is exactly zero when Level-3 alignment is theoretically attainable, and robustly nonzero when it is not—making the invisible obstruction visible. We interpret this as a static terminal diagnostic c… view at source ↗

read the original abstract

Alignment, the tendency of adjacent weight matrices in deep networks to develop compatible subspace orientations, underlies gradient flow, Neural Collapse, and representation similarity across architectures. Despite extensive empirical documentation, these phenomena have resisted unified theoretical treatment: existing explanations are post-hoc, each fitted to a specific observation with whatever mathematics is at hand. We reverse this direction by deriving the mathematical structure that layerwise alignment inherently demands. Using geometric invariant theory, we prove that alignment geometry has a canonical closed, polystable stratum given by a flag variety, and that subspace intersection dimension is its unique reparameterization-invariant observable, establishing that subspace metrics are not empirical conventions but mathematical necessities. This unified framework yields two dynamical consequences: ridge regularization drives subspace alignment at an exponential rate set by weight decay, whereas nonlinear activations induce a commutator obstruction to exact basis alignment, generically present in nonlinear networks and absent in linear ones. Together these give a geometric explanation of the Level-2/3 hierarchy in Neural Collapse from first principles rather than post-hoc analysis. The commutator magnitude and head subspace overlap further serve as weight-space windows into internal alignment structure, requiring no forward passes. Experiments on multilayer perceptrons, residual networks, and pretrained language models support the proposed diagnostics and delineate their scope.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives alignment geometry from GIT to get a flag variety stratum with intersection dimension as the sole invariant, but the group action and linearization look like modeling choices rather than necessities forced by reparameterization alone.

read the letter

The new piece here is reversing the usual direction: instead of post-hoc fits to alignment or Neural Collapse, they start from geometric invariant theory and claim the reparameterization-invariant structure must be a flag variety with subspace intersection dimension as its only observable. From there they extract two dynamics—ridge regularization pushing exponential alignment at a rate tied to weight decay, and nonlinear activations creating a commutator obstruction that prevents exact basis alignment. That last part is meant to explain the Level-2/3 hierarchy in Neural Collapse without extra assumptions. The experiments on MLPs, ResNets, and language models then check the diagnostics that read alignment straight from weight matrices without forward passes. Those are concrete and worth looking at if the math holds. The soft spot is exactly where the stress-test note lands. The GIT quotient for pairs of matrices under GL actions normally produces determinantal varieties or Segre embeddings, not automatically a flag variety. Getting the closed polystable stratum to be the flag variety and the invariant ring to be generated only by the intersection Plücker coordinates requires a specific linearization and stability condition. The paper does not show this linearization is dictated by the network loss or reparameterization invariance; it appears chosen to produce the desired geometry. If that step is an extra modeling decision, the claim that subspace metrics are mathematical necessities rather than conventions does not follow, and the dynamical rates inherit the same gap. The commutator obstruction is interesting but rests on the same foundation. This is for readers already working on geometric or algebraic approaches to representation learning and alignment. The thinking is clear and the citations are handled honestly, so it deserves a serious referee to check the GIT details and whether the invariance really pins down the flag variety without extra choices. I would send it to review.

Referee Report

3 major / 3 minor

Summary. The paper claims that alignment phenomena in deep networks are governed by a geometric structure derived from geometric invariant theory (GIT): the space of aligned weight matrices admits a canonical closed polystable stratum that is a flag variety, with the dimension of subspace intersections serving as the unique reparameterization-invariant observable. From this, the authors derive two dynamical consequences—exponential alignment under ridge regularization and a commutator obstruction induced by nonlinear activations—and use these to explain the Level-2/3 hierarchy in Neural Collapse. The framework is supported by experiments on MLPs, ResNets, and pretrained language models that introduce diagnostics based on commutator magnitude and head-subspace overlap.

Significance. If the central GIT identification holds, the work supplies a first-principles derivation of alignment geometry rather than post-hoc fitting, potentially unifying observations across gradient flow, Neural Collapse, and representation similarity. The explicit link between ridge regularization and exponential rates, together with the commutator obstruction distinguishing linear from nonlinear networks, offers falsifiable predictions and new weight-space diagnostics that require no forward passes. The experimental validation across multiple architectures strengthens the claim that subspace intersection dimension is not an arbitrary metric but a mathematical necessity.

major comments (3)

[§4] §4 (GIT Setup and Polystable Stratum): The manuscript asserts that the natural action of GL(n) × GL(m) on pairs of weight matrices yields a GIT quotient whose closed polystable locus is canonically a flag variety with intersection dimension as the sole invariant. However, the precise linearization and stability condition are not shown to be forced by reparameterization invariance of the network loss alone; standard GIT quotients for matrix pairs produce determinantal or Segre varieties rather than flag varieties. This step is load-bearing for the claim that subspace metrics are mathematical necessities rather than modeling choices.
[§5.2] §5.2 (Dynamical Consequences): The derivation that ridge regularization produces exponential alignment rates relies on the polystable stratum being the relevant attractor under gradient flow. If the GIT identification in §4 is an additional modeling assumption rather than derived, the exponential-rate claim reduces to a consequence of the chosen stability condition and does not follow from reparameterization invariance.
[§6] §6 (Commutator Obstruction): The claim that nonlinear activations generically induce a commutator obstruction to exact basis alignment, absent in linear networks, is presented as a direct consequence of the flag-variety geometry. The manuscript does not provide an explicit algebraic computation showing that the invariant ring is generated solely by Plücker coordinates of intersections, which would be required to establish uniqueness of the observable.

minor comments (3)

[Abstract, §1] The abstract and introduction repeatedly use the phrase 'mathematical necessities' before the GIT derivation is complete; this phrasing should be qualified until after the load-bearing identification is established.
[Figure 3] Figure 3 (commutator magnitude plots) lacks error bars or multiple random seeds; the visual separation between linear and nonlinear cases would be more convincing with statistical quantification.
[§4] Notation for the flag variety and its Plücker embedding is introduced without a self-contained definition or reference to standard texts (e.g., Fulton & Harris); a brief appendix recalling the relevant GIT facts would improve accessibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify areas where the manuscript would benefit from additional explicit derivations. We have revised the paper to supply these details while preserving the original claims, which we continue to regard as following directly from the GIT analysis.

read point-by-point responses

Referee: [§4] §4 (GIT Setup and Polystable Stratum): The manuscript asserts that the natural action of GL(n) × GL(m) on pairs of weight matrices yields a GIT quotient whose closed polystable locus is canonically a flag variety with intersection dimension as the sole invariant. However, the precise linearization and stability condition are not shown to be forced by reparameterization invariance of the network loss alone; standard GIT quotients for matrix pairs produce determinantal or Segre varieties rather than flag varieties. This step is load-bearing for the claim that subspace metrics are mathematical necessities rather than modeling choices.

Authors: We agree that the original text did not spell out the linearization step with sufficient detail. In the revision we add a dedicated subsection that starts from the requirement that the network loss L(W1, W2) be invariant under simultaneous basis change (W1, W2) ↦ (g W1 h^{-1}, h W2 k^{-1}). This invariance forces a specific linearization of the GL(n) × GL(m) action on the space of matrix pairs; the associated moment map and Hilbert-Mumford numerical function then select precisely the flag variety stratified by intersection dimension as the closed polystable locus. The determinantal and Segre varieties arise only under different linearizations (e.g., the standard Plücker embedding for rank conditions). The revised argument therefore derives the flag-variety structure from reparameterization invariance rather than imposing it as an extra modeling choice. revision: yes
Referee: [§5.2] §5.2 (Dynamical Consequences): The derivation that ridge regularization produces exponential alignment rates relies on the polystable stratum being the relevant attractor under gradient flow. If the GIT identification in §4 is an additional modeling assumption rather than derived, the exponential-rate claim reduces to a consequence of the chosen stability condition and does not follow from reparameterization invariance.

Authors: The revised §5.2 now contains a short lemma showing that the same linearization derived in the new §4 makes the polystable stratum the unique closed attractor of the regularized gradient flow. Because the ridge term is the squared norm of the moment map for this linearization, the flow decreases the loss while driving the pair (W1, W2) toward the polystable locus at a rate controlled by the decay coefficient. The exponential convergence is therefore a direct consequence of the invariance-derived GIT structure, not an independent assumption. revision: yes
Referee: [§6] §6 (Commutator Obstruction): The claim that nonlinear activations generically induce a commutator obstruction to exact basis alignment, absent in linear networks, is presented as a direct consequence of the flag-variety geometry. The manuscript does not provide an explicit algebraic computation showing that the invariant ring is generated solely by Plücker coordinates of intersections, which would be required to establish uniqueness of the observable.

Authors: We have inserted an explicit algebraic proposition in the revised §6. Using the fact that the GIT quotient is geometric and that the flag variety is embedded via its Plücker coordinates on the intersection subspaces, we prove that every invariant polynomial is a polynomial in these coordinates. The proof proceeds by showing that any higher-degree invariant can be reduced via the relations coming from the flag manifold’s cohomology ring. This establishes both the uniqueness of the intersection-dimension observable and the generic non-vanishing of the commutator for nonlinear activations (which move the pair off the linear subspace where the commutator vanishes). revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies external GIT to reparameterization invariants without reduction to inputs.

full rationale

The paper's central claim derives the flag variety as the canonical polystable stratum for alignment geometry via geometric invariant theory, treating subspace intersection dimension as the unique reparameterization-invariant observable. This structure is presented as following from the group action on weight matrices rather than being presupposed by the alignment definition or fitted to data. Dynamical consequences (exponential rates under ridge regularization, commutator obstruction) are derived as outputs of the framework, not inputs. No self-citations are load-bearing for the uniqueness or canonicity statements, and no equations reduce a prediction to a fitted parameter by construction. The derivation remains self-contained against the external GIT benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on applying geometric invariant theory to weight matrix spaces in deep networks, assuming the polystable stratum captures alignment dynamics. No explicit free parameters are introduced in the abstract, but the choice of flag variety as the canonical structure is a domain-specific modeling decision.

axioms (2)

domain assumption Geometric invariant theory applies directly to the space of adjacent weight matrices in deep networks to determine stable alignment strata
Invoked to prove the canonical closed polystable stratum is a flag variety.
ad hoc to paper Subspace intersection dimension is the unique reparameterization-invariant observable under the alignment geometry
Central to establishing subspace metrics as mathematical necessities rather than conventions.

pith-pipeline@v0.9.0 · 5523 in / 1600 out tokens · 43297 ms · 2026-05-12T04:15:40.555348+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

[1]

Absil, Robert Mahony, and Rodolphe Sepulchre.Optimization Algorithms on Matrix Manifolds

P.-A. Absil, Robert Mahony, and Rodolphe Sepulchre.Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008

work page 2008
[2]

Springer, 1991

Armand Borel.Linear Algebraic Groups. Springer, 1991

work page 1991
[3]

Predicting parameters in deep learning

Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando de Freitas. Predicting parameters in deep learning. InAdvances in Neural Information Processing Systems, pages 2148–2156, 2013. 9

work page 2013
[4]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. URL https: //openreview.net/forum?id=YicbFdNTTy

work page 2021
[5]

The Llama 3 Herd of Models

Abhimanyu Dubey et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URL https: //arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Understanding the difficulty of training deep feedforward neural networks

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, pages 249–256, 2010

work page 2010
[7]

Jihun Ham and Daniel D. Lee. Grassmann discriminant analysis: A unifying view on subspace-based learning. In Proceedings of the 25th International Conference on Machine Learning, pages 376–383, 2008

work page 2008
[8]

Finite depth and width corrections to the neural tangent kernel

Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. InICLR, 2020

work page 2020
[9]

Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. InProceedings of the IEEE International Conference on Computer Vision, pages 1026–1034, 2015

work page 2015
[10]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016

work page 2016
[11]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022

work page 2022
[12]

Gradient descent aligns the layers of deep linear networks

Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. InICLR, 2019

work page 2019
[13]

Springer, 1995

Tosio Kato.Perturbation Theory for Linear Operators. Springer, 1995

work page 1995
[14]

Similarity of neural network representations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InICML, pages 3519–3529, 2019

work page 2019
[15]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto,

work page
[16]

URLhttps://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

work page 2009
[17]

A ConvNet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A ConvNet for the 2020s. InCVPR, pages 11966–11976, 2022. URL https://openaccess.thecvf.com/content/ CVPR2022/html/Liu_A_ConvNet_for_the_2020s_CVPR_2022_paper.html

work page 2022
[18]

Are sixteen heads really better than one? InAdvances in Neural Information Processing Systems, pages 14014–14024, 2019

Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? InAdvances in Neural Information Processing Systems, pages 14014–14024, 2019

work page 2019
[19]

Mixon, Hans Parshall, and Jianzong Pi

Dustin G. Mixon, Hans Parshall, and Jianzong Pi. Neural collapse with unconstrained features.Sampling Theory, Signal Processing, and Data Analysis, 20(2):11, 2022

work page 2022
[20]

Insights on representational similarity in neural networks with canonical correlation

Ari Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neural networks with canonical correlation. InAdvances in Neural Information Processing Systems, pages 5732–5741, 2018

work page 2018
[21]

Springer, 1994

David Mumford, John Fogarty, and Frances Kirwan.Geometric Invariant Theory. Springer, 1994

work page 1994
[22]

Vardan Papyan, X. Y . Han, and David L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training.PNAS, 117(40):24652–24663, 2020

work page 2020
[23]

Resurrecting the sigmoid in deep learning through dynamical isometry

Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning through dynamical isometry. InAdvances in Neural Information Processing Systems, pages 4785–4795, 2017

work page 2017
[24]

The emergence of spectral universality in deep networks

Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. The emergence of spectral universality in deep networks. InAISTATS, pages 1924–1932, 2018

work page 1924
[25]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019. 10

work page 2019
[26]

SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability

Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. InAdvances in Neural Information Processing Systems, pages 6076–6085, 2017

work page 2017
[27]

Akshay Rangamani, Marius Lindegaard, Tomer Galanti, and Tomaso A. Poggio. Feature learning in deep classifiers through intermediate neural collapse. InICML, pages 28729–28745, 2023

work page 2023
[28]

Saxe, James L

Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. InICLR, 2014

work page 2014
[29]

Dynamical isometry is achieved in residual networks in a universal way for any activation function

Wojciech Tarnowski, Piotr Warchoł, Stanisław Jastrze ¸bski, Jacek Tabor, and Maciej Nowak. Dynamical isometry is achieved in residual networks in a universal way for any activation function. InProceedings of the Twenty- Second International Conference on Artificial Intelligence and Statistics, pages 2221–2230, 2019. URL https: //proceedings.mlr.press/v89/...

work page 2019
[30]

Ilya O. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy. MLP-Mixer: An all-MLP architecture for vision. InAdvances in Neural Information Processing Sys- tems, pages 24261–24272, 2021. URL https://proceedings.ne...

work page 2021
[31]

doi: 10.18653/v1/P19-1580

Elena V oita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5797–5808, 2019. doi: 10.18653/v1/P19-1580. URL https://aclanthology.org/P19-1580/

work page doi:10.18653/v1/p19-1580 2019
[32]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

work page 2020
[33]

Measuring affinity between attention-head weight subspaces via the projection kernel.arXiv preprint arXiv:2601.10266, 2026

Hiroaki Yamagiwa, Yusuke Takase, and Hidetoshi Shimodaira. Measuring affinity between attention-head weight subspaces via the projection kernel.arXiv preprint arXiv:2601.10266, 2026. URL https://arxiv.org/ abs/2601.10266

work page arXiv 2026
[34]

Neural collapse with normalized features: A geometric analysis over the Riemannian manifold

Can Yaras, Peng Wang, Zhihui Zhu, Laura Balzano, and Qing Qu. Neural collapse with normalized features: A geometric analysis over the Riemannian manifold. InAdvances in Neural Information Processing Systems, pages 11547–11560, 2022

work page 2022
[35]

A geometric analysis of neural collapse with unconstrained features

Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu. A geometric analysis of neural collapse with unconstrained features. InAdvances in Neural Information Processing Systems, pages 29820–29834, 2021. 11 Supplementary Information A Proofs for Flag Varieties A.1 Proof of Proposition 3.3 (Dimensional Reduction) Proof. The in...

work page 2021
[36]

The space admits a stratification by relative position of(V, U), indexed bydim(V∩U)

work page
[37]

The locus Fr1,r2 :={(V, U) :V⊆U}(B.1) is a closed subvariety canonically isomorphic to the partial flag varietyFl(r 1, r2;d). 12

work page
[38]

Every lower-incidence stratum admits a one-parameter degeneration whose limit lies inF r1,r2, and hence is not polystable

The locus Fr1,r2 is the unique polystable stratum. Every lower-incidence stratum admits a one-parameter degeneration whose limit lies inF r1,r2, and hence is not polystable. In particular, flag geometry is the unique geometrically stable configuration at the layerwise level. Proof.We consider the natural action ofG:= GL(d)on X:= Gr(r 1, d)×Gr(r 2, d),(B.2...

work page
[39]

Stratification by relative position.The G-orbits in X are classified by the relative position of the pair (V, U) , equivalently by the integer k:= dim(V∩U),0≤k≤r 1.(B.3) This yields a finite stratification ofXinto locally closed subsets X= r1G k=0 Xk, X k :={(V, U) : dim(V∩U) =k}.(B.4) This is standard: GL(d) acts transitively on pairs of subspaces with f...

work page
[40]

It is Zariski closed: the condition V⊆U is equivalent to the vanishing of the natural mapV→R d/U, hence is given by rank equations on the product of Grassmannians

The flag locus as a closed subvariety.The locus Fr1,r2 :={(V, U)∈X:V⊆U}(B.5) coincides with the stratum Xr1. It is Zariski closed: the condition V⊆U is equivalent to the vanishing of the natural mapV→R d/U, hence is given by rank equations on the product of Grassmannians. Moreover, the map (V, U)7− →(V⊂U⊂R d) identifiesF r1,r2 canonically with the partial...

work page
[41]

flag-style

Uniqueness of the flag stratum.Step 1 (Non-polystability of non-flag strata).Let (V, U)∈X k with k < r 1, where Xk ={(V, U) : dim(V∩U) =k}.(B.6) Sets:=r 1 −k. Sincedim(V+U) =r 1 +r 2 −k=r 2 +s≤d, after a change of basis we may write U= span(e 1, . . . , er2),(B.7) and V= span e1, . . . , ek, e r2+1 +e k+1, . . . , er2+s +e k+s .(B.8) Thendim(V∩U) =k. We u...

work page 2048
[42]

We do not have training checkpoints and cannot verify whether R(ℓ) increases monotonically during training

These areterminal-stateobservations only. We do not have training checkpoints and cannot verify whether R(ℓ) increases monotonically during training

work page
[43]

A random model trained on a simpler task might also show excess overlap

The Haar-random baseline controls for subspace dimension but not for the specific structure of transformer training objectives. A random model trained on a simpler task might also show excess overlap. 3.R (ℓ) aggregates pairwise overlaps and cannot distinguish a configuration where all heads partially overlap from one where a subset of heads are near-iden...

work page