pith. machine review for the scientific record. sign in

arxiv: 2604.08764 · v1 · submitted 2026-04-09 · 💻 cs.CL · math.DG

Recognition: unknown

Revisiting Anisotropy in Language Transformers: The Geometry of Learning Dynamics

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:02 UTC · model grok-4.3

classification 💻 cs.CL math.DG
keywords anisotropytransformerslearning dynamicstangent spacegradient anisotropyrepresentation geometrymechanistic interpretability
0
0 comments X

The pith

Low-rank tangent proxies from activations capture most gradient anisotropy in language transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that anisotropy in transformer representations is not a mysterious post-training feature but a direct result of learning dynamics that amplify directions tangent to the representation manifold. It first derives geometric reasons why frequency-biased sampling reduces visible curvature and why gradient steps preferentially strengthen tangent components. The authors then apply concept-based interpretability mid-training to construct low-rank tangent proxies from activations and compare them directly to true backpropagated gradients. These proxies explain both unusually large gradient energy and a markedly higher share of the total anisotropy than random low-rank controls of the same size, and the pattern holds across encoder and decoder models. A sympathetic reader would care because the result reframes anisotropy as a predictable geometric consequence of optimization rather than an unexplained distortion.

Core claim

By fitting activation-derived low-rank tangent proxies during training rather than only after the fact, these directions capture both unusually large gradient energy and a substantially larger share of gradient anisotropy than matched-rank normal controls, providing strong empirical support for a tangent-aligned account of anisotropy across encoder-style and decoder-style language models.

What carries the argument

Activation-derived low-rank tangent proxies that approximate the tangent space of the representation manifold and are tested against true gradients.

If this is right

  • Training dynamics preferentially amplify tangent directions, explaining why anisotropy grows during optimization.
  • Frequency-biased sampling attenuates curvature visibility, making the manifold appear flatter than it is.
  • The tangent-aligned mechanism operates similarly in both encoder-style and decoder-style language models.
  • Low-rank proxies built from activations can serve as efficient diagnostics for gradient behavior without full backpropagation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Regularizing curvature during training might reduce unwanted anisotropy while preserving task performance.
  • The geometric account could be tested in non-transformer architectures to check if tangent amplification is general.
  • If the proxies predict downstream behavior, they could guide early stopping or architecture adjustments.

Load-bearing premise

The low-rank proxies constructed from activations during training accurately approximate the true tangent space of the representation manifold and that this alignment is not an artifact of the specific concept-based interpretability method chosen.

What would settle it

Observing that activation-derived low-rank proxies capture no more gradient anisotropy than random vectors of equal rank in a held-out model, or that true gradients align equally with any low-rank subspace.

Figures

Figures reproduced from arXiv: 2604.08764 by Antonin Poch\'e, C\'eline Hudelot, Fanny Jourdan, Raphael Bernas.

Figure 1
Figure 1. Figure 1: Temporal dynamics of embedding variance across model architectures. Left: Training trajectories for the top-5 most frequent versus bottom-5 least frequent tokens in PYTHIA-410m, illustrating the tighter confinement of frequent tokens. Middle: Mean Euclidean distance to centroid vs. log-frequency for PYTHIA (160m, 410m, 1b, 1.4b), SMOLLM2 (360m, 1.7b), and EUROBERT (210m, 610m). Causal decoders exhibit a st… view at source ↗
Figure 2
Figure 2. Figure 2: Curvature bias and its attenuation. (a) Under uniform radial sampling, directions with higher curvature κ reach larger intrinsic distances r at the same ambient radius t, causing over-representation of curved regions. (b) When the sampling density g(t) concentrates at small t, all directions yield similar intrinsic displacements and the curvature bias is attenuated. Intrinsic vs. extrinsic distance. Consid… view at source ↗
Figure 3
Figure 3. Figure 3: Embedding-update tangent and normal enrichments across token frequency. For each model, we compute tangent and normal enrichment ratios on the embedding update dW = Wt+n − Wt, normalized so that the random baseline is 1. The left panel in each model reports tangent enrichment; the right panel reports normal enrichment. Curves are plotted against log10(freq) and colored by token rarity group, which makes th… view at source ↗
Figure 3
Figure 3. Figure 3: Embedding-update tangent and normal enrichments across token frequency. For each model, we compute tangent and normal enrichment ratios on the embedding update dW = Wt+n − Wt, normalized so that the random baseline is 1. The left panel in each model reports tangent enrichment; the right panel reports normal enrichment. Curves are plotted against log10(freq) and colored by token rarity group, which makes th… view at source ↗
Figure 4
Figure 4. Figure 4: Forest Cover Type Anisotropic Components. large energy relative to matched normal alternatives, and the direct T > N sign test, which asks whether tangent removal improves isotropy more consistently than matched-normal removal across anchors. These comparisons do not replace IsoScore*; they provide the geometric attribution that a global scalar anisotropy statistic cannot provide on its own. In that sense,… view at source ↗
Figure 5
Figure 5. Figure 5: Evolution of gradient-data concept alignment for the embedding layer during training. We report three complementary alignment metrics as a function of the gradient monitoring step: (left) mean correlation over all extracted concepts, (middle) mean correlation restricted to the top-10 highest-aligned concepts, and (right) mean correlation for high-quality alignments (above the spectral gap threshold). Data … view at source ↗
Figure 6
Figure 6. Figure 6: Correlation density maps for the three main components of the model: the first transformer linear head, the second transformer linear head, and the embedding matrix (first layer), shown as a function of training time (x-axis). To construct these maps, we first pass data-derived concepts through the model to obtain an estimate of the concepts learned by the network. Independently, we treat the weight matric… view at source ↗
read the original abstract

Since their introduction, Transformer architectures have dominated Natural Language Processing (NLP). However, recent research has highlighted an inherent anisotropy phenomenon in these models, presenting a significant challenge to their geometric interpretation. Previous theoretical studies on this phenomenon are rarely grounded in the underlying representation geometry. In this paper, we extend them by deriving geometric arguments for how frequency-biased sampling attenuates curvature visibility and why training preferentially amplify tangent directions. Empirically, we then use concept-based mechanistic interpretability during training, rather than only post hoc, to fit activation-derived low-rank tangent proxies and test them against ordinary backpropagated true gradients. Across encoder-style and decoder-style language models, we find that these activation-derived directions capture both unusually large gradient energy and a substantially larger share of gradient anisotropy than matched-rank normal controls, providing strong empirical support for a tangent-aligned account of anisotropy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper derives geometric arguments showing that frequency-biased sampling in language data attenuates curvature visibility on the representation manifold and that gradient descent preferentially amplifies tangent directions. It then constructs activation-derived low-rank tangent proxies via concept-based mechanistic interpretability applied during training (rather than post hoc) and reports that these proxies capture substantially more gradient energy and a larger fraction of gradient anisotropy than matched-rank normal controls, across both encoder- and decoder-style models.

Significance. If the low-rank proxies are shown to faithfully span the tangent space, the combination of geometric derivation and controlled empirical comparison would provide a coherent, manifold-based account of anisotropy that moves beyond purely post-hoc geometric observations and could guide future work on representation geometry and training dynamics.

major comments (1)
  1. The central empirical claim (that the activation-derived proxies capture unusually large gradient energy and anisotropy share) is load-bearing on the assumption that these proxies accurately approximate the true tangent space. The manuscript does not appear to include independent verification of this (e.g., curvature estimates, orthogonality checks against estimated normal vectors, or ablation of the concept-fitting procedure), leaving open the possibility that the observed alignment is partly an artifact of the interpretability method's selection of high-variance or gradient-correlated directions.
minor comments (2)
  1. The description of the proxy-fitting procedure and the exact definition of 'matched-rank normal controls' would benefit from additional detail on hyperparameter choices and statistical controls to allow full reproducibility.
  2. Figure captions and axis labels should explicitly state whether error bars represent standard deviation across seeds, layers, or models.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for highlighting a key assumption in our empirical section. We respond to the major comment below and will incorporate additional verification steps in the revised manuscript.

read point-by-point responses
  1. Referee: The central empirical claim (that the activation-derived proxies capture unusually large gradient energy and anisotropy share) is load-bearing on the assumption that these proxies accurately approximate the true tangent space. The manuscript does not appear to include independent verification of this (e.g., curvature estimates, orthogonality checks against estimated normal vectors, or ablation of the concept-fitting procedure), leaving open the possibility that the observed alignment is partly an artifact of the interpretability method's selection of high-variance or gradient-correlated directions.

    Authors: We appreciate the referee identifying this assumption. The proxies are obtained via concept-based mechanistic interpretability applied during training, which extracts low-rank directions of activation variation that the geometric analysis in Section 3 predicts should align with the tangent space (because frequency-biased sampling reduces observable curvature). The primary evidence we report is their superior capture of gradient energy and anisotropy share relative to matched-rank controls. We acknowledge, however, that this comparison alone does not fully rule out selection artifacts. In the revision we will add (i) an ablation replacing the concept-fitting step with random and top-variance directions of the same rank, (ii) orthogonality checks of the proxies against estimated normal directions on a subset of layers, and (iii) curvature estimates along the proxy directions for the smaller models. These additions will directly test whether the observed alignment exceeds what would be expected from high-variance or gradient-correlated directions alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via independent geometric arguments and controlled empirical tests

full rationale

The paper derives geometric arguments from manifold properties regarding frequency-biased sampling and tangent amplification during training, then conducts empirical validation by fitting low-rank proxies from activations via concept-based interpretability and comparing them to backpropagated gradients against matched-rank controls. No quoted steps reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations; the central claim rests on comparative statistics that are not tautological with the proxy construction. The analysis is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that representation spaces admit a well-defined tangent geometry and that activation statistics during training can proxy that geometry without additional fitted constants beyond the low-rank choice.

free parameters (1)
  • rank of tangent proxies
    Low-rank approximation dimension is chosen to fit activation data; exact value not stated in abstract but affects the comparison.
axioms (1)
  • domain assumption The representation manifold has a tangent space that can be approximated from activations
    Invoked when constructing low-rank tangent proxies and linking them to gradient directions.

pith-pipeline@v0.9.0 · 5453 in / 1231 out tokens · 65562 ms · 2026-05-10T17:02:26.407170+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

  1. [1]

    naacl-main.403/

    URL https://aclanthology.org/2021. naacl-main.403/. Boizard, N., Gisserot-Boukhlef, H., Alves, D. M., Martins, A., Hammal, A., Corro, C., Hudelot, C., Malherbe, E., Malaboeuf, E., Jourdan, F., Hautreux, G., Alves, J., El- Haddad, K., Faysse, M., Peyrard, M., Guerreiro, N. M., Fernandes, P., Rei, R., and Colombo, P. Eurobert: Scaling multilingual encoders ...

  2. [2]

    emnlp-main.344/

    URL https://aclanthology.org/2024. emnlp-main.344/. 10 Revisiting Anisotropy in Language Transformers: The Geometry of Learning Dynamics Diehl Martinez, R., Lesci, P., and Buttery, P. Tending towards stability: Convergence challenges in small language models. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.),Findings of the Associa- tion for Computa...

  3. [3]

    Ethayarajh, K

    doi: 10.1109/TPAMI.2008.277. Ethayarajh, K. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings, 2019. URL https://arxiv. org/abs/1909.00512. Facco, E., d’Errico, M., Rodriguez, A., and Laio, A. Es- timating the intrinsic dimension of datasets by a min- imal neighborhood information.Scientific...

  4. [4]

    Representation degeneration problem in training natural language generation models.arXiv preprint arXiv:1907.12009, 2019

    URL https://aclanthology.org/2022. findings-emnlp.314/. Gao, J., He, D., Tan, X., Qin, T., Wang, L., and Liu, T.-Y . Representation degeneration problem in training natural language generation models, 2019. URL https:// arxiv.org/abs/1907.12009. Gao, T. The diffusion geometry of fibre bundles: Horizontal diffusion maps, 2019. URL https://arxiv.org/ abs/16...

  5. [5]

    In Ku, L.-W., Martins, A

    PMLR. URL https://proceedings.mlr. press/v9/glorot10a.html. Godey, N., Clergerie, ´E., and Sagot, B. Anisotropy is in- herent to self-attention in transformers. In Graham, Y . and Purver, M. (eds.),Proceedings of the 18th Confer- ence of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 35–48, St. Julian’s,...

  6. [6]

    URL https://www.science.org/doi/ abs/10.1126/science.290.5500.2323. Roy, O. and Vetterli, M. The effective rank: A measure of effective dimensionality. In2007 15th European Signal Processing Conference, pp. 606–610, 2007. Rudman, W. and Eickhoff, C. Stable anisotropic regulariza- tion, 2024. URL https://arxiv.org/abs/2305. 19358. Rudman, W., Gillman, N., ...

  7. [7]

    findings-acl.262/

    URL https://aclanthology.org/2022. findings-acl.262/. Sadrtdinov, I., Klimov, I., Lobacheva, E., and Vetrov, D. Sgd as free energy minimization: A thermodynamic view on neural network training, 2025. URL https://arxiv. org/abs/2505.23489. Singh, A. K., Moskovitz, T., Hill, F., Chan, S. C. Y ., and Saxe, A. M. What needs to go right for an induction head? ...

  8. [8]

    curve back

    URL https://aclanthology.org/2020. findings-emnlp.46/. Zhou, K., Ethayarajh, K., and Jurafsky, D. Frequency-based distortions in contextualized word embeddings, 2021. URLhttps://arxiv.org/abs/2104.08465. 13 Revisiting Anisotropy in Language Transformers: The Geometry of Learning Dynamics A. Proof of Proposition 2.1 We invert (1) to expressr 2 as a functio...