arxiv: 2604.08764 · v1 · submitted 2026-04-09 · 💻 cs.CL · math.DG

Recognition: unknown

Revisiting Anisotropy in Language Transformers: The Geometry of Learning Dynamics

Raphael Bernas , Fanny Jourdan , Antonin Poch\'e , C\'eline Hudelot

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:02 UTC · model grok-4.3

classification 💻 cs.CL math.DG

keywords anisotropytransformerslearning dynamicstangent spacegradient anisotropyrepresentation geometrymechanistic interpretability

0 comments

The pith

Low-rank tangent proxies from activations capture most gradient anisotropy in language transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that anisotropy in transformer representations is not a mysterious post-training feature but a direct result of learning dynamics that amplify directions tangent to the representation manifold. It first derives geometric reasons why frequency-biased sampling reduces visible curvature and why gradient steps preferentially strengthen tangent components. The authors then apply concept-based interpretability mid-training to construct low-rank tangent proxies from activations and compare them directly to true backpropagated gradients. These proxies explain both unusually large gradient energy and a markedly higher share of the total anisotropy than random low-rank controls of the same size, and the pattern holds across encoder and decoder models. A sympathetic reader would care because the result reframes anisotropy as a predictable geometric consequence of optimization rather than an unexplained distortion.

Core claim

By fitting activation-derived low-rank tangent proxies during training rather than only after the fact, these directions capture both unusually large gradient energy and a substantially larger share of gradient anisotropy than matched-rank normal controls, providing strong empirical support for a tangent-aligned account of anisotropy across encoder-style and decoder-style language models.

What carries the argument

Activation-derived low-rank tangent proxies that approximate the tangent space of the representation manifold and are tested against true gradients.

If this is right

Training dynamics preferentially amplify tangent directions, explaining why anisotropy grows during optimization.
Frequency-biased sampling attenuates curvature visibility, making the manifold appear flatter than it is.
The tangent-aligned mechanism operates similarly in both encoder-style and decoder-style language models.
Low-rank proxies built from activations can serve as efficient diagnostics for gradient behavior without full backpropagation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Regularizing curvature during training might reduce unwanted anisotropy while preserving task performance.
The geometric account could be tested in non-transformer architectures to check if tangent amplification is general.
If the proxies predict downstream behavior, they could guide early stopping or architecture adjustments.

Load-bearing premise

The low-rank proxies constructed from activations during training accurately approximate the true tangent space of the representation manifold and that this alignment is not an artifact of the specific concept-based interpretability method chosen.

What would settle it

Observing that activation-derived low-rank proxies capture no more gradient anisotropy than random vectors of equal rank in a held-out model, or that true gradients align equally with any low-rank subspace.

Figures

Figures reproduced from arXiv: 2604.08764 by Antonin Poch\'e, C\'eline Hudelot, Fanny Jourdan, Raphael Bernas.

**Figure 1.** Figure 1: Temporal dynamics of embedding variance across model architectures. Left: Training trajectories for the top-5 most frequent versus bottom-5 least frequent tokens in PYTHIA-410m, illustrating the tighter confinement of frequent tokens. Middle: Mean Euclidean distance to centroid vs. log-frequency for PYTHIA (160m, 410m, 1b, 1.4b), SMOLLM2 (360m, 1.7b), and EUROBERT (210m, 610m). Causal decoders exhibit a st… view at source ↗

**Figure 2.** Figure 2: Curvature bias and its attenuation. (a) Under uniform radial sampling, directions with higher curvature κ reach larger intrinsic distances r at the same ambient radius t, causing over-representation of curved regions. (b) When the sampling density g(t) concentrates at small t, all directions yield similar intrinsic displacements and the curvature bias is attenuated. Intrinsic vs. extrinsic distance. Consid… view at source ↗

**Figure 3.** Figure 3: Embedding-update tangent and normal enrichments across token frequency. For each model, we compute tangent and normal enrichment ratios on the embedding update dW = Wt+n − Wt, normalized so that the random baseline is 1. The left panel in each model reports tangent enrichment; the right panel reports normal enrichment. Curves are plotted against log10(freq) and colored by token rarity group, which makes th… view at source ↗

**Figure 4.** Figure 4: Forest Cover Type Anisotropic Components. large energy relative to matched normal alternatives, and the direct T > N sign test, which asks whether tangent removal improves isotropy more consistently than matched-normal removal across anchors. These comparisons do not replace IsoScore*; they provide the geometric attribution that a global scalar anisotropy statistic cannot provide on its own. In that sense,… view at source ↗

**Figure 5.** Figure 5: Evolution of gradient-data concept alignment for the embedding layer during training. We report three complementary alignment metrics as a function of the gradient monitoring step: (left) mean correlation over all extracted concepts, (middle) mean correlation restricted to the top-10 highest-aligned concepts, and (right) mean correlation for high-quality alignments (above the spectral gap threshold). Data … view at source ↗

**Figure 6.** Figure 6: Correlation density maps for the three main components of the model: the first transformer linear head, the second transformer linear head, and the embedding matrix (first layer), shown as a function of training time (x-axis). To construct these maps, we first pass data-derived concepts through the model to obtain an estimate of the concepts learned by the network. Independently, we treat the weight matric… view at source ↗

read the original abstract

Since their introduction, Transformer architectures have dominated Natural Language Processing (NLP). However, recent research has highlighted an inherent anisotropy phenomenon in these models, presenting a significant challenge to their geometric interpretation. Previous theoretical studies on this phenomenon are rarely grounded in the underlying representation geometry. In this paper, we extend them by deriving geometric arguments for how frequency-biased sampling attenuates curvature visibility and why training preferentially amplify tangent directions. Empirically, we then use concept-based mechanistic interpretability during training, rather than only post hoc, to fit activation-derived low-rank tangent proxies and test them against ordinary backpropagated true gradients. Across encoder-style and decoder-style language models, we find that these activation-derived directions capture both unusually large gradient energy and a substantially larger share of gradient anisotropy than matched-rank normal controls, providing strong empirical support for a tangent-aligned account of anisotropy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Geometric derivations tie anisotropy to tangent amplification during training, with proxies beating controls, but the tangent approximation step stays the weakest link.

read the letter

The paper derives from manifold geometry why frequency-biased sampling reduces visible curvature and why gradient steps should preferentially stretch tangent directions. It then fits low-rank proxies from activations using concept-based interpretability run during training, not after, and shows these directions account for more gradient energy and a bigger slice of anisotropy than rank-matched random controls across both encoder and decoder models. That combination of first-principles argument and during-training controlled test is the actual addition over earlier anisotropy papers. The derivations look clean and avoid heavy parameter fitting, and the empirical controls are matched on rank, which keeps the comparison honest. The central empirical result holds up on the numbers they report. The soft spot is exactly the one the stress-test flags. The proxies come from a concept-fitting procedure, so it is possible they pick directions that are already gradient-correlated or high-variance for reasons outside the geometry. The paper does not appear to add independent checks such as curvature along the proxy directions or explicit tests for orthogonality to the normal bundle. Without those, the alignment with true gradients remains suggestive rather than conclusive. This is for people already working on the geometry of representations or mechanistic interpretability of transformers. It is focused and the controls are decent, so it clears the threshold for peer review even if the proxy validation will need tightening in revision.

Referee Report

1 major / 2 minor

Summary. The paper derives geometric arguments showing that frequency-biased sampling in language data attenuates curvature visibility on the representation manifold and that gradient descent preferentially amplifies tangent directions. It then constructs activation-derived low-rank tangent proxies via concept-based mechanistic interpretability applied during training (rather than post hoc) and reports that these proxies capture substantially more gradient energy and a larger fraction of gradient anisotropy than matched-rank normal controls, across both encoder- and decoder-style models.

Significance. If the low-rank proxies are shown to faithfully span the tangent space, the combination of geometric derivation and controlled empirical comparison would provide a coherent, manifold-based account of anisotropy that moves beyond purely post-hoc geometric observations and could guide future work on representation geometry and training dynamics.

major comments (1)

The central empirical claim (that the activation-derived proxies capture unusually large gradient energy and anisotropy share) is load-bearing on the assumption that these proxies accurately approximate the true tangent space. The manuscript does not appear to include independent verification of this (e.g., curvature estimates, orthogonality checks against estimated normal vectors, or ablation of the concept-fitting procedure), leaving open the possibility that the observed alignment is partly an artifact of the interpretability method's selection of high-variance or gradient-correlated directions.

minor comments (2)

The description of the proxy-fitting procedure and the exact definition of 'matched-rank normal controls' would benefit from additional detail on hyperparameter choices and statistical controls to allow full reproducibility.
Figure captions and axis labels should explicitly state whether error bars represent standard deviation across seeds, layers, or models.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for highlighting a key assumption in our empirical section. We respond to the major comment below and will incorporate additional verification steps in the revised manuscript.

read point-by-point responses

Referee: The central empirical claim (that the activation-derived proxies capture unusually large gradient energy and anisotropy share) is load-bearing on the assumption that these proxies accurately approximate the true tangent space. The manuscript does not appear to include independent verification of this (e.g., curvature estimates, orthogonality checks against estimated normal vectors, or ablation of the concept-fitting procedure), leaving open the possibility that the observed alignment is partly an artifact of the interpretability method's selection of high-variance or gradient-correlated directions.

Authors: We appreciate the referee identifying this assumption. The proxies are obtained via concept-based mechanistic interpretability applied during training, which extracts low-rank directions of activation variation that the geometric analysis in Section 3 predicts should align with the tangent space (because frequency-biased sampling reduces observable curvature). The primary evidence we report is their superior capture of gradient energy and anisotropy share relative to matched-rank controls. We acknowledge, however, that this comparison alone does not fully rule out selection artifacts. In the revision we will add (i) an ablation replacing the concept-fitting step with random and top-variance directions of the same rank, (ii) orthogonality checks of the proxies against estimated normal directions on a subset of layers, and (iii) curvature estimates along the proxy directions for the smaller models. These additions will directly test whether the observed alignment exceeds what would be expected from high-variance or gradient-correlated directions alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via independent geometric arguments and controlled empirical tests

full rationale

The paper derives geometric arguments from manifold properties regarding frequency-biased sampling and tangent amplification during training, then conducts empirical validation by fitting low-rank proxies from activations via concept-based interpretability and comparing them to backpropagated gradients against matched-rank controls. No quoted steps reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations; the central claim rests on comparative statistics that are not tautological with the proxy construction. The analysis is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that representation spaces admit a well-defined tangent geometry and that activation statistics during training can proxy that geometry without additional fitted constants beyond the low-rank choice.

free parameters (1)

rank of tangent proxies
Low-rank approximation dimension is chosen to fit activation data; exact value not stated in abstract but affects the comparison.

axioms (1)

domain assumption The representation manifold has a tangent space that can be approximated from activations
Invoked when constructing low-rank tangent proxies and linking them to gradient directions.

pith-pipeline@v0.9.0 · 5453 in / 1231 out tokens · 65562 ms · 2026-05-10T17:02:26.407170+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

[1]

naacl-main.403/

URL https://aclanthology.org/2021. naacl-main.403/. Boizard, N., Gisserot-Boukhlef, H., Alves, D. M., Martins, A., Hammal, A., Corro, C., Hudelot, C., Malherbe, E., Malaboeuf, E., Jourdan, F., Hautreux, G., Alves, J., El- Haddad, K., Faysse, M., Peyrard, M., Guerreiro, N. M., Fernandes, P., Rei, R., and Colombo, P. Eurobert: Scaling multilingual encoders ...

work page doi:10.1007/0-8176-4481-4_9 2021
[2]

emnlp-main.344/

URL https://aclanthology.org/2024. emnlp-main.344/. 10 Revisiting Anisotropy in Language Transformers: The Geometry of Learning Dynamics Diehl Martinez, R., Lesci, P., and Buttery, P. Tending towards stability: Convergence challenges in small language models. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.),Findings of the Associa- tion for Computa...

work page doi:10.18653/v1/2024.findings-emnlp.187 2024
[3]

Ethayarajh, K

doi: 10.1109/TPAMI.2008.277. Ethayarajh, K. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings, 2019. URL https://arxiv. org/abs/1909.00512. Facco, E., d’Errico, M., Rodriguez, A., and Laio, A. Es- timating the intrinsic dimension of datasets by a min- imal neighborhood information.Scientific...

work page doi:10.1109/tpami.2008.277 2008
[4]

Representation degeneration problem in training natural language generation models.arXiv preprint arXiv:1907.12009, 2019

URL https://aclanthology.org/2022. findings-emnlp.314/. Gao, J., He, D., Tan, X., Qin, T., Wang, L., and Liu, T.-Y . Representation degeneration problem in training natural language generation models, 2019. URL https:// arxiv.org/abs/1907.12009. Gao, T. The diffusion geometry of fibre bundles: Horizontal diffusion maps, 2019. URL https://arxiv.org/ abs/16...

work page arXiv 2022
[5]

In Ku, L.-W., Martins, A

PMLR. URL https://proceedings.mlr. press/v9/glorot10a.html. Godey, N., Clergerie, ´E., and Sagot, B. Anisotropy is in- herent to self-attention in transformers. In Graham, Y . and Purver, M. (eds.),Proceedings of the 18th Confer- ence of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 35–48, St. Julian’s,...

work page doi:10.18653/v1/2024 2024
[6]

URL https://www.science.org/doi/ abs/10.1126/science.290.5500.2323. Roy, O. and Vetterli, M. The effective rank: A measure of effective dimensionality. In2007 15th European Signal Processing Conference, pp. 606–610, 2007. Rudman, W. and Eickhoff, C. Stable anisotropic regulariza- tion, 2024. URL https://arxiv.org/abs/2305. 19358. Rudman, W., Gillman, N., ...

work page doi:10.1126/science.290.5500.2323 2007
[7]

findings-acl.262/

URL https://aclanthology.org/2022. findings-acl.262/. Sadrtdinov, I., Klimov, I., Lobacheva, E., and Vetrov, D. Sgd as free energy minimization: A thermodynamic view on neural network training, 2025. URL https://arxiv. org/abs/2505.23489. Singh, A. K., Moskovitz, T., Hill, F., Chan, S. C. Y ., and Saxe, A. M. What needs to go right for an induction head? ...

work page doi:10.18653/v1/2020.findings-emnlp 2022
[8]

curve back

URL https://aclanthology.org/2020. findings-emnlp.46/. Zhou, K., Ethayarajh, K., and Jurafsky, D. Frequency-based distortions in contextualized word embeddings, 2021. URLhttps://arxiv.org/abs/2104.08465. 13 Revisiting Anisotropy in Language Transformers: The Geometry of Learning Dynamics A. Proof of Proposition 2.1 We invert (1) to expressr 2 as a functio...

work page arXiv 2020