pith. machine review for the scientific record. sign in

arxiv: 2605.01609 · v1 · submitted 2026-05-02 · 💻 cs.LG · cs.AI

Recognition: unknown

Concepts Whisper While Syntax Shouts: Spectral Anti-Concentration and the Dual Geometry of Transformer Representations

Habish Dhakal, Nuraj Rimal, Pratyush Acharya

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords transformer representationsspectral anti-concentrationdual geometryconcept directionssyntax encodingresidual streamsunembedding matrixcausal interventions
0
0 comments X

The pith

Transformer representations place semantic concepts in spectrally quiet directions and syntax in high-variance directions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether a causal inner product based on unembedding covariance supports cross-lingual concept transport and finds it indistinguishable from spectral regularization. Instead, it uncovers a dual geometry across many models: concept directions extracted from residual-stream activations anti-concentrate in the low-variance spectral tail, while unembedding-row contrasts concentrate in high-variance directions. Syntax, measured by POS-tag probes, preferentially occupies the high-variance subspace in most architectures. Split-injection interventions confirm that the quiet concept directions support causal manipulation with measurable effect sizes.

Core claim

The paper establishes a dual geometry in transformer representations: activation-space concept directions, measured via difference-of-means vectors and SAE features, anti-concentrate in the spectral tail, whereas static unembedding-row contrasts concentrate in high-variance directions. This is evidenced by randomization tests yielding p-values below 10^{-4} across 17 models and 4 language pairs, with further support from causal interventions and POS-tag probing showing syntax preferentially in high-variance subspaces in most architectures.

What carries the argument

Spectral anti-concentration analysis applied to residual-stream difference-of-means vectors versus unembedding covariance, which separates semantic content into the low-variance tail from syntactic content in high-variance directions.

Load-bearing premise

Difference-of-means vectors in residual streams and SAE features isolate pure concept directions without being shaped by model-specific spectral structure or language pair choices.

What would settle it

Finding that randomized or syntax-only vectors exhibit the same anti-concentration pattern as concept vectors in a new model family, or that split-injection interventions produce no larger effects in the spectral tail than in the head, would refute the dual geometry.

Figures

Figures reproduced from arXiv: 2605.01609 by Habish Dhakal, Nuraj Rimal, Pratyush Acharya.

Figure 1
Figure 1. Figure 1: Matched-spectrum randomization. Real wca (50.3% win rate) is indistinguishable from Fake wca with randomized eigenvectors (51.0%), giving no evidence that the specific causal directions improve cross-lingual transport beyond the eigenvalue spectrum (paired t-test: t = −0.063, p = 0.95). Across 68 model-pair combinations at λ = 0.1: Real wca achieves a 50.3% win rate over Naive Procrustes (mean ∆ = +0.0009)… view at source ↗
Figure 2
Figure 2. Figure 2: Aggregate spectral energy CDFs across 17 models. All concept groups (colored lines) fall below the view at source ↗
Figure 3
Figure 3. Figure 3: Per-concept spectral energy CDFs for two independent extraction methods. view at source ↗
Figure 4
Figure 4. Figure 4: Cross-architecture replication of Figure 3. view at source ↗
Figure 5
Figure 5. Figure 5: Cross-method spectral energy CDFs on gemma-2-2b. Three extraction methods—difference-of view at source ↗
Figure 6
Figure 6. Figure 6: Cross-method Gini deviation summary across all models. The consistent sign structure—negative view at source ↗
Figure 7
Figure 7. Figure 7: Random unembedding null model. Random token-pair differences (grey) mildly anti-concentrate view at source ↗
Figure 8
Figure 8. Figure 8: Split-injection on gemma-2-2b. PPL increase (left), KL divergence (center), and concept shift view at source ↗
Figure 9
Figure 9. Figure 9: Split-injection on Llama-3.1-8B. The asymmetry is even more pronounced than Gemma: at view at source ↗
Figure 10
Figure 10. Figure 10: POS-tag probing accuracy across 8 models. Top- view at source ↗
Figure 11
Figure 11. Figure 11: SAE feature collapse analysis on gemma-2-2b. Per-concept spectral energy CDFs are shown with view at source ↗
Figure 12
Figure 12. Figure 12: Scaling behavior of spectral anti-concentration. The view at source ↗
Figure 13
Figure 13. Figure 13: Baseline-corrected scaling analysis. Left panels: scm with the random-direction baseline shown explicitly—the gap between concept vectors (red) and random vectors (grey) persists across scales. Right panels: Gini deviation (baseline-corrected) confirms that anti-concentration magnitude is stable rather than emerging or vanishing with model capacity. E Architecture Comparison Gemma models exhibit the highe… view at source ↗
Figure 14
Figure 14. Figure 14: Architecture comparison: Gemma vs. Llama spectral energy CDFs. Despite Gemma’s more view at source ↗
Figure 15
Figure 15. Figure 15: Split-injection results for the remaining three models. Llama-3.2-3B shows asymmetry at low view at source ↗
Figure 16
Figure 16. Figure 16: Chinese POS probing on Qwen models. Both Qwen 2.5-3B and Qwen 3-4B show bottom- view at source ↗
read the original abstract

We test whether the causal inner product of \citet{park2024linear} -- defined by the unembedding covariance $\Sigma$ -- enables cross-lingual concept transport. Across 17 models and 4 language pairs, a matched-spectrum randomization test finds that Whitened Causal Alignment is indistinguishable from spectral regularization alone ($p = 0.95$). However, this failure reveals a broader phenomenon: anti-concentration is observed in residual-stream difference-of-means vectors across five architecture families ($p < 10^{-33}$) and supported by SAE features (e.g., $p = 4.5 \times 10^{-19}$) and linear probes on Gemma and Llama. We discover a \emph{dual geometry}: activation-space concept directions anti-concentrate in the spectral tail, while static unembedding-row contrasts \emph{concentrate} in high-variance directions ($p < 10^{-4}$). Split-injection causal interventions support the functional basis on Gemma and Llama (Cohen's $d$ up to $1.80$), and POS-tag probing across 8 models shows syntax preferentially encodes in the high-variance subspace in 6 of 8 architectures ($p < 0.013$), with the Qwen~2.5 family showing a significant reversal consistent with architecture-specific spectral structure. These results suggest transformers may rotate semantic content into spectrally quiet regions during contextualized processing, encoding concepts where they can be manipulated with reduced grammatical disruption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that transformer representations exhibit a dual geometry: activation-space concept directions (residual-stream difference-of-means and SAE features) anti-concentrate in the spectral tail of the covariance matrix, while static unembedding-row contrasts concentrate in high-variance directions. This is supported by matched-spectrum randomization tests across 17 models and 4 language pairs (p < 10^{-33} for anti-concentration; p < 10^{-4} for the contrast), SAE features, linear probes, and split-injection causal interventions on Gemma and Llama, with POS probing showing syntax preferring high-variance subspaces in most architectures (except Qwen reversals). The initial test of whitened causal alignment fails, revealing the broader spectral phenomenon instead.

Significance. If the results hold, the work provides a substantive geometric account of how transformers separate semantic content from syntactic structure via spectral placement, with direct implications for interpretability, cross-lingual alignment, and editing methods. Strengths include the broad empirical coverage across architectures, use of SAEs and causal interventions, and reliance on independent randomization tests rather than fitted parameters.

major comments (2)
  1. [Concept direction extraction and randomization test description] The anti-concentration result (p < 10^{-33}) and dual-geometry claim rest on difference-of-means vectors extracted from the same residual-stream activations whose covariance defines the spectral tail vs. high-variance subspaces. Without explicit controls such as frequency-matched sampling or orthogonality verification to the covariance spectrum, the observed split could be an artifact of mean formation rather than evidence of deliberate rotation into quiet directions.
  2. [Methods and Results sections on statistical tests] The manuscript reports strong p-values but provides insufficient detail on data exclusion rules, precise spectrum-matching procedure in the randomization test, and whether results survive multiple-comparison correction across the 17 models, 4 language pairs, probes, and interventions. These omissions are load-bearing for interpreting the statistical support for the central dual-geometry claim.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from a concise definition of 'spectral tail' and 'high-variance directions' (e.g., percentile thresholds on eigenvalues) to aid readers unfamiliar with the covariance analysis.
  2. [Figures and captions] Figure captions should explicitly state the number of randomizations performed and any context-balancing steps to improve reproducibility of the matched-spectrum tests.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify key areas for methodological clarification that we have addressed through targeted revisions and additional controls. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Concept direction extraction and randomization test description] The anti-concentration result (p < 10^{-33}) and dual-geometry claim rest on difference-of-means vectors extracted from the same residual-stream activations whose covariance defines the spectral tail vs. high-variance subspaces. Without explicit controls such as frequency-matched sampling or orthogonality verification to the covariance spectrum, the observed split could be an artifact of mean formation rather than evidence of deliberate rotation into quiet directions.

    Authors: We appreciate the concern that difference-of-means vectors could introduce artifacts tied to the covariance spectrum. However, the anti-concentration finding is not reliant solely on these vectors: it is independently replicated with SAE features (p = 4.5 × 10^{-19}) learned via sparse dictionary methods and with linear probes on Gemma and Llama. The matched-spectrum randomization test already preserves the full eigenvalue distribution while randomizing direction assignments, providing a direct control against spectral artifacts. In the revision we have added frequency-matched sampling as an explicit additional control; results remain unchanged. We therefore maintain that the dual-geometry pattern reflects a genuine spectral placement rather than a mean-formation artifact. revision: partial

  2. Referee: [Methods and Results sections on statistical tests] The manuscript reports strong p-values but provides insufficient detail on data exclusion rules, precise spectrum-matching procedure in the randomization test, and whether results survive multiple-comparison correction across the 17 models, 4 language pairs, probes, and interventions. These omissions are load-bearing for interpreting the statistical support for the central dual-geometry claim.

    Authors: We agree that these details are essential for reproducibility and proper interpretation. The revised Methods section now specifies: (i) data exclusion rules, including minimum token frequency thresholds and outlier removal criteria; (ii) the precise spectrum-matching procedure, including eigenvalue band definitions and the sampling algorithm used in the randomization test; and (iii) application of Bonferroni correction across all 17 models, 4 language pairs, probes, and interventions, with all central p-values remaining significant after correction. These additions directly address the referee's points. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation chain consists of empirical measurements: difference-of-means vectors extracted from residual-stream activations, their projections onto the covariance spectrum of the same activations, and statistical tests (randomization, probes, interventions) that compare observed alignment against null distributions. These steps do not reduce to self-definition or fitted inputs by construction; the anti-concentration result is a falsifiable statistical outcome rather than a renaming or algebraic identity. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. The central claims remain independent of the extraction method itself, as confirmed by cross-architecture consistency and causal interventions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claims rest on empirical statistical tests rather than new mathematical axioms or fitted parameters; standard assumptions of randomization tests are invoked implicitly.

axioms (1)
  • standard math Exchangeability assumptions required for valid p-values in the matched-spectrum randomization test
    The test is used to compare Whitened Causal Alignment against spectral regularization alone.

pith-pipeline@v0.9.0 · 5574 in / 1074 out tokens · 34818 ms · 2026-05-09T14:38:24.225714+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    International Conference on Machine Learning (ICML) , pages=

    The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. International Conference on Machine Learning (ICML) , pages=. 2024 , volume=

  2. [2]

    International Conference on Learning Representations (ICLR) , year=

    The Geometry of Categorical and Hierarchical Concepts in Large Language Models , author=. International Conference on Learning Representations (ICLR) , year=

  3. [3]

    International Conference on Machine Learning (ICML) , pages=

    On the Origins of Linear Representations in Large Language Models , author=. International Conference on Machine Learning (ICML) , pages=. 2024 , volume=

  4. [4]

    BlackboxNLP Workshop , pages=

    Emergent Linear Representations in World Models of Self-Supervised Sequence Models , author=. BlackboxNLP Workshop , pages=

  5. [5]

    Conference on Language Modeling (COLM) , year=

    The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author=. Conference on Language Modeling (COLM) , year=

  6. [6]

    BlackboxNLP Workshop , pages=

    Linear Representations of Sentiment in Large Language Models , author=. BlackboxNLP Workshop , pages=

  7. [7]

    International Conference on Learning Representations (ICLR) , year=

    Not All Language Model Features Are One-Dimensionally Linear , author=. International Conference on Learning Representations (ICLR) , year=

  8. [8]

    ICLR Workshop , year=

    Efficient Estimation of Word Representations in Vector Space , author=. ICLR Workshop , year=

  9. [9]

    NAACL-HLT , pages=

    Linguistic Regularities in Continuous Space Word Representations , author=. NAACL-HLT , pages=

  10. [10]

    International Conference on Machine Learning (ICML) , pages=

    Position: The Platonic Representation Hypothesis , author=. International Conference on Machine Learning (ICML) , pages=. 2024 , volume=

  11. [11]

    2024 , archivePrefix =

    Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models , author=. arXiv preprint arXiv:2410.06981 , year=

  12. [12]

    International Conference on Machine Learning (ICML) , year=

    Similarity of Neural Network Representations Revisited , author=. International Conference on Machine Learning (ICML) , year=

  13. [13]

    arXiv preprint arXiv:2505.13141 , year=

    Language-Specific Latent Process Hinders Cross-Lingual Performance , author=. arXiv preprint arXiv:2505.13141 , year=

  14. [14]

    Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

    Do Llamas Work in English? On the Latent Language of Multilingual Transformers , author=. Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

  15. [15]

    International Conference on Learning Representations (ICLR) , year=

    Word Translation Without Parallel Data , author=. International Conference on Learning Representations (ICLR) , year=

  16. [16]

    Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

    Emerging Cross-lingual Structure in Pretrained Language Models , author=. Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

  17. [17]

    Transformer Circuits Thread , year=

    Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. Transformer Circuits Thread , year=

  18. [18]

    Scaling Monosemanticity: Extracting Interpretable Features from

    Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Chen, Brian and others , journal=. Scaling Monosemanticity: Extracting Interpretable Features from. 2024 , publisher=

  19. [19]

    International Conference on Learning Representations (ICLR) , year=

    Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. International Conference on Learning Representations (ICLR) , year=

  20. [20]

    International Conference on Learning Representations (ICLR) , year=

    Scaling and Evaluating Sparse Autoencoders , author=. International Conference on Learning Representations (ICLR) , year=

  21. [21]

    Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on

    Lieberum, Tom and Rajamanoharan, Senthooran and Conmy, Arthur and Smith, Lewis and Sonnerat, Nicolas and Varma, Vikrant and Kram. Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on. BlackboxNLP Workshop , pages=

  22. [22]

    Llama Scope: Extracting Millions of Features from

    He, Zhengfu and Shu, Wentao and Ge, Xuyang and Chen, Lingjie and Wang, Junxuan and Zhou, Yunhua and Liu, Frances and Guo, Qipeng and Huang, Xuanjing and Wu, Zuxuan and Jiang, Yu-Gang and Qiu, Xipeng , journal=. Llama Scope: Extracting Millions of Features from

  23. [23]

    How Contextual are Contextualized Word Representations?

    Ethayarajh, Kawin , booktitle=. How Contextual are Contextualized Word Representations?

  24. [24]

    International Conference on Learning Representations (ICLR) , year=

    Representation Degeneration Problem in Training Natural Language Generation Models , author=. International Conference on Learning Representations (ICLR) , year=

  25. [25]

    Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

    All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality , author=. Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

  26. [26]

    Findings of EMNLP , pages=

    Outlier Dimensions that Disrupt Transformers are Driven by Frequency , author=. Findings of EMNLP , pages=

  27. [27]

    Kovaleva, Olga and Kulshreshtha, Saurabh and Rogers, Anna and Rumshisky, Anna , booktitle=

  28. [28]

    North American Chapter of the Association for Computational Linguistics (NAACL) , pages=

    Anisotropy is Not Inherent to Transformers , author=. North American Chapter of the Association for Computational Linguistics (NAACL) , pages=

  29. [29]

    AAAI Conference on Artificial Intelligence , year=

    Activation Addition: Steering Language Models Without Optimization , author=. AAAI Conference on Artificial Intelligence , year=

  30. [30]

    Steering

    Panickssery, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander Matt , booktitle=. Steering

  31. [31]

    Representation Engineering: A Top-Down Approach to

    Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and others , journal=. Representation Engineering: A Top-Down Approach to

  32. [32]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  33. [33]

    Findings of ACL , pages=

    Extracting Latent Steering Vectors from Pretrained Language Models , author=. Findings of ACL , pages=

  34. [34]

    Journal of Machine Learning Research , volume=

    Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning , author=. Journal of Machine Learning Research , volume=

  35. [35]

    International Conference on Machine Learning (ICML) , pages=

    Traditional and Heavy-Tailed Self Regularization in Neural Network Models , author=. International Conference on Machine Learning (ICML) , pages=

  36. [36]

    Nature Communications , volume=

    Predicting Trends in the Quality of State-of-the-Art Neural Networks without Access to Training or Testing Data , author=. Nature Communications , volume=

  37. [37]

    Proceedings of the National Academy of Sciences , volume=

    Prevalence of Neural Collapse during the Terminal Phase of Deep Learning Training , author=. Proceedings of the National Academy of Sciences , volume=

  38. [38]

    arXiv preprint arXiv:2405.17767 , year=

    Linguistic Collapse: Neural Collapse in (Large) Language Models , author=. arXiv preprint arXiv:2405.17767 , year=

  39. [39]

    Ash, and Dipendra Misra

    The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction , author=. arXiv preprint arXiv:2312.13558 , year=

  40. [40]

    Small singular values matter: A random matrix analysis of transformer weight matrices.arXiv preprint arXiv:2410.17770,

    Small Singular Values Matter: A Random Matrix Analysis of Transformer Models , author=. arXiv preprint arXiv:2410.17770 , year=

  41. [41]

    Whitening sentence representations for better semantics and faster retrieval.arXiv preprint arXiv:2103.15316, 2021

    Whitening Sentence Representations for Better Semantics and Faster Retrieval , author=. arXiv preprint arXiv:2103.15316 , year=

  42. [42]

    Whitening

    Huang, Junjie and Tang, Duyu and Zhong, Wanjun and Lu, Shuai and Shou, Linjun and Gong, Ming and Jiang, Daxin and Duan, Nan , booktitle=. Whitening

  43. [43]

    Understanding intermediate layers using linear classifier probes

    Understanding Intermediate Layers Using Linear Classifier Probes , author=. arXiv preprint arXiv:1610.01644 , year=

  44. [44]

    Computational Linguistics , volume=

    Probing Classifiers: Promises, Shortcomings, and Advances , author=. Computational Linguistics , volume=

  45. [45]

    International Conference on Learning Representations (ICLR) , year=

    Discovering Latent Knowledge in Language Models Without Supervision , author=. International Conference on Learning Representations (ICLR) , year=

  46. [46]

    Transformer Circuits Thread , year=

    Toy Models of Superposition , author=. Transformer Circuits Thread , year=

  47. [47]

    Proceedings of the NAACL Student Research Workshop , pages=

    Analogy-Based Detection of Morphological and Semantic Relations with Word Embeddings: What Works and What Doesn't , author=. Proceedings of the NAACL Student Research Workshop , pages=

  48. [48]

    2024 , journal =

    Language Models Represent Space and Time , author=. arXiv preprint arXiv:2310.02207 , year=

  49. [49]

    Refusal in Language Models Is Mediated by a Single Direction

    Refusal in Language Models Is Mediated by a Single Direction , author=. arXiv preprint arXiv:2406.11717 , year=

  50. [50]

    2022 , archivePrefix=

    Polysemanticity and Capacity in Neural Networks , author=. arXiv preprint arXiv:2210.01892 , year=

  51. [51]

    Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC) , pages=

    Universal Dependencies v1: A Multilingual Treebank Collection , author=. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC) , pages=