Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence
Pith reviewed 2026-05-25 04:11 UTC · model grok-4.3
The pith
The hierarchical geometry of concepts in language model embeddings arises from the spectral properties of word co-occurrence statistics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Starting from the assumption that words closer on the WordNet hypernym graph co-occur more often, the spectrum of the embedding Gram matrix under mild positivity and decay conditions on the co-occurrence kernel produces leading eigenvectors that first separate broad taxonomic branches and then progressively finer sub-branches, resulting in a hierarchical splitting geometry with coarse-to-fine spectral organization that mirrors the tree.
What carries the argument
The spectrum of the embedding Gram matrix of the co-occurrence kernel, whose leading eigenvectors successively isolate broader then narrower branches of the hypernym tree.
If this is right
- The same coarse-to-fine splitting signature appears in both static word2vec embeddings and Gemma 2B unembeddings.
- Hierarchical concept geometry in LLMs can emerge without any hierarchy-specific functional mechanism.
- The organization is fully determined by the spectral properties of pairwise word statistics.
Where Pith is reading between the lines
- The result suggests that analogous hierarchical structure could appear in any embedding space whose Gram matrix satisfies similar positivity and decay conditions on its kernel.
- It may explain why taxonomic relations surface in embeddings trained only on next-token prediction without explicit tree supervision.
- The same mechanism could be tested on correlation matrices from other modalities or languages to check whether the coarse-to-fine pattern is generic.
Load-bearing premise
Words closer on the WordNet hypernym graph co-occur more often.
What would settle it
A WordNet subtree in which the leading eigenvectors of the co-occurrence Gram matrix fail to isolate broad branches before finer ones would falsify the claimed spectral organization.
Figures
read the original abstract
We propose a distributional theory of how hypernymy -- the ``is-a'' relation between general and specific concepts -- is encoded geometrically in language representations. Starting from the empirically verified assumption that words closer on the WordNet hypernym graph co-occur more often, we characterize theoretically the spectrum of the resulting embedding Gram matrix of word2vec embeddings. Under mild positivity and decay conditions on the co-occurrence kernel, we prove that the leading eigenvectors first separate broad taxonomic branches and then progressively finer sub-branches, producing a \emph{hierarchical splitting geometry} with a coarse-to-fine spectral organization that mirrors the tree. We confirm these predictions in word2vec embeddings across many sampled WordNet subtrees, and show that the same signature extends strikingly well to Gemma 2B unembeddings. Our results indicate that hierarchical concept geometry in LLMs need not reflect a hierarchy-specific functional mechanism, but emerges from the spectral structure of pairwise word statistics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a distributional theory of hypernymy in language representations. It starts from the assumption that words closer on the WordNet hypernym graph co-occur more frequently, then proves that under positivity and decay conditions on the resulting co-occurrence kernel the leading eigenvectors of the word2vec Gram matrix exhibit hierarchical splitting: broad taxonomic branches separate first, followed by progressively finer sub-branches, yielding a coarse-to-fine spectral organization that mirrors the tree. The same signature is reported to hold in Gemma 2B unembeddings. The central claim is that this geometry emerges from pairwise statistics rather than hierarchy-specific mechanisms.
Significance. If the derivation is correct and the kernel assumption holds with the stated conditions, the result supplies a parameter-free spectral explanation for hierarchical geometry observed across embedding methods. It links standard co-occurrence statistics directly to tree-like organization in both static and contextual models, offering a unified account that could be tested on additional corpora and architectures.
major comments (2)
- [Abstract and theoretical derivation] The proof (theoretical section following the assumption statement) shows that a kernel with positivity and decay yields the claimed eigenvector splitting, but the mapping from WordNet distance to the kernel is justified solely by the modeling choice that co-occurrence decreases with hypernym distance. No quantitative verification (e.g., measured decay rates or correlation values between pairwise co-occurrence counts and shortest-path distances on the sampled subtrees) is supplied to confirm that the positivity/decay conditions are met in the data used for the word2vec experiments; this assumption is load-bearing for transferring the theorem to real embeddings.
- [Empirical validation] The empirical confirmation across WordNet subtrees (experimental section) reports that the predicted splitting signature appears in word2vec and extends to Gemma, yet the manuscript does not include controls that would isolate the contribution of the WordNet-derived kernel from other factors (topical or frequency-based associations) that also shape co-occurrence; without such controls the experiments cannot rule out that the observed geometry arises for reasons orthogonal to the tree structure assumed in the proof.
minor comments (2)
- [Theoretical setup] Notation for the co-occurrence kernel and the Gram matrix should be introduced with explicit definitions before the spectral analysis begins to improve readability.
- [Introduction] The abstract states the assumption is 'empirically verified' but the main text would benefit from a short dedicated paragraph or table summarizing the verification statistics.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and outline revisions that will strengthen the connection between the theoretical assumptions and the empirical results.
read point-by-point responses
-
Referee: [Abstract and theoretical derivation] The proof (theoretical section following the assumption statement) shows that a kernel with positivity and decay yields the claimed eigenvector splitting, but the mapping from WordNet distance to the kernel is justified solely by the modeling choice that co-occurrence decreases with hypernym distance. No quantitative verification (e.g., measured decay rates or correlation values between pairwise co-occurrence counts and shortest-path distances on the sampled subtrees) is supplied to confirm that the positivity/decay conditions are met in the data used for the word2vec experiments; this assumption is load-bearing for transferring the theorem to real embeddings.
Authors: We agree that explicit quantitative checks on the kernel conditions are necessary to make the transfer from theorem to data fully rigorous. In the revised manuscript we will add a dedicated subsection that reports (i) Pearson and Spearman correlations between empirical co-occurrence counts and WordNet shortest-path distances across the sampled subtrees, (ii) verification that the resulting kernel is positive, and (iii) confirmation of the required decay behavior. These statistics will be computed on the same corpora used for the word2vec training runs. revision: yes
-
Referee: [Empirical validation] The empirical confirmation across WordNet subtrees (experimental section) reports that the predicted splitting signature appears in word2vec and extends to Gemma, yet the manuscript does not include controls that would isolate the contribution of the WordNet-derived kernel from other factors (topical or frequency-based associations) that also shape co-occurrence; without such controls the experiments cannot rule out that the observed geometry arises for reasons orthogonal to the tree structure assumed in the proof.
Authors: We accept that additional controls are required to demonstrate specificity to the WordNet-induced co-occurrence structure. In revision we will include two control experiments: (1) word2vec embeddings trained on a version of the corpus in which co-occurrence statistics have been randomized while preserving marginal frequencies, and (2) embeddings derived from synthetic hierarchies whose distance kernels do not match WordNet. We will show that the hierarchical splitting signature is markedly weaker or absent under these controls, thereby isolating the contribution of the tree-structured kernel. revision: yes
Circularity Check
No circularity: derivation proceeds from external empirical assumption via spectral analysis
full rationale
The paper begins with the stated modeling assumption that co-occurrence probability decreases with WordNet hypernym distance, then imposes positivity and decay conditions on the resulting kernel and applies spectral analysis to prove the coarse-to-fine eigenvector splitting. This chain is a direct mathematical consequence of the kernel properties and does not reduce any claimed prediction or theorem to a fitted parameter, self-citation, or quantity defined in terms of the output. Verification on word2vec and Gemma embeddings is presented as external confirmation rather than part of the derivation itself. No load-bearing step matches any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Words closer on the WordNet hypernym graph co-occur more often
- standard math Mild positivity and decay conditions on the co-occurrence kernel
Reference graph
Works this paper leans on
-
[1]
Not all language model features are one-dimensionally linear
Joshua Engels, Eric J Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. Not all language model features are one-dimensionally linear. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[2]
Language models implement simple Word2Vec-style vector arithmetic
Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. Language models implement simple Word2Vec-style vector arithmetic. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5030–504...
work page 2024
-
[3]
Alexander Modell, Patrick Rubin-Delanchy, and Nick Whiteley. The origins of representation manifolds in large language models.arXiv preprint arXiv:2505.18235, 2025
-
[4]
Language models represent space and time
Wes Gurnee and Max Tegmark. Language models represent space and time. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[5]
When models manipulate manifolds: The geometry of a counting task
Wes Gurnee, Emmanuel Ameisen, Isaac Kauvar, Julius Tarng, Adam Pearce, Chris Olah, and Joshua Batson. When models manipulate manifolds: The geometry of a counting task. Transformer Circuits Thread, 2025
work page 2025
-
[6]
Hanlin Zhu, Melissa Franch, Elizabeth A. Mickiewicz, James L. Belanger, Rhiannon L. Cowan, Kalman A. Katlowitz, Ana G. Chavez, Assia Chericoni, Danika Paulo, Xinyuan Yan, Shervin Rahimpour, Ben Shofty, Eleonora Bartoli, Jay A. Hennig, Nicole R. Provenza, Elliot H. Smith, Steven T. Piantadosi, Benjamin Y . Hayden, and Sameer A. Sheth. A geometric foundatio...
work page 2026
-
[7]
George A. Miller. Wordnet: a lexical database for english.Commun. ACM, 38(11):39–41, November 1995
work page 1995
-
[8]
Poincaré embeddings for learning hierarchical represen- tations
Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical represen- tations. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017
work page 2017
-
[9]
Nghia Nguyen, Tianjiao Ding, and René Vidal. Hierarchical concept embedding & pursuit for interpretable image classification.arXiv preprint arXiv:2602.11448, 2026
-
[10]
Learning semantic hierarchies via word embeddings
Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, Haifeng Wang, and Ting Liu. Learning semantic hierarchies via word embeddings. In Kristina Toutanova and Hua Wu, editors,Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1199–1209, Baltimore, Maryland, June 2014. Association for Computational...
work page 2014
-
[11]
Linear Representations of Hierarchical Concepts in Language Models
Masaki Sakata, Benjamin Heinzerling, Takumi Ito, Sho Yokoi, and Kentaro Inui. Linear representations of hierarchical concepts in language models.arXiv preprint arXiv:2604.07886, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Pierre Orhan, Pablo Diego-Simón, Emmanuel Chemla, Yair Lakretz, Yves Boubenec, and Jean-Rémi King. Emergence of phonemic, syntactic, and semantic representations in artificial neural networks.arXiv preprint arXiv:2601.18617, 2026
-
[13]
The geometry of categorical and hierarchical concepts in large language models
Kiho Park, Yo Joong Choe, Yibo Jiang, and Victor Veitch. The geometry of categorical and hierarchical concepts in large language models. InThe Thirteenth International Conference on Learning Representations, 2025. 10
work page 2025
-
[14]
Andrew M. Saxe, James L. McClelland, and Surya Ganguli. A mathematical theory of semantic development in deep neural networks.Proceedings of the National Academy of Sciences, 116(23):11537–11546, 2019
work page 2019
-
[15]
Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Learning hierarchical categories in deep neural networks. InProceedings of the Annual Meeting of the Cognitive Science Society, volume 35, 2013
work page 2013
-
[16]
Neural word embedding as implicit matrix factorization
Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. Advances in Neural Information Processing Systems, 27, 2014
work page 2014
-
[17]
On the emergence of linear analogies in word embeddings
Daniel J Korchinski, Dhruva Karkada, Yasaman Bahri, and Matthieu Wyart. On the emergence of linear analogies in word embeddings. InAdvances in Neural Information Processing Systems, 2025
work page 2025
-
[18]
Dhruva Karkada, Daniel J Korchinski, Andres Nava, Matthieu Wyart, and Yasaman Bahri. Symmetry in language statistics shapes the geometry of model representations.arXiv preprint arXiv:2602.15029, 2026
-
[19]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre- sentations of words and phrases and their compositionality.Advances in Neural Information Processing Systems, 26, 2013
work page 2013
-
[20]
Glove: Global vectors for word representation
Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014
work page 2014
-
[21]
Closed-form training dynamics reveal learned features and linear structure in word2vec-like models
Dhruva Karkada, James B Simon, Yasaman Bahri, and Michael R DeWeese. Closed-form training dynamics reveal learned features and linear structure in word2vec-like models. In Advances in Neural Information Processing Systems, 2025
work page 2025
-
[22]
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Anto...
work page 2024
-
[23]
The linear representation hypothesis and the geometry of large language models
Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024
work page 2024
-
[24]
Leslie Kish.Survey sampling.Wiley, 1965
work page 1965
-
[25]
Neel Nanda and Joseph Bloom. Transformerlens. https://github.com/ TransformerLensOrg/TransformerLens, 2022. 11
work page 2022
-
[26]
The llama 3 herd of models, 07 2024
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, et al. The llama 3 herd of models, 07 2024
work page 2024
-
[27]
θm + 1 2 m−1X a=0 θa # , r, s= 0, . . . , h,(42) and A(h) r,s =αq rqs
Olivier Ledoit and Michael Wolf. A well-conditioned estimator for large-dimensional covariance matrices.Journal of Multivariate Analysis, 88(2):365–411, February 2004. 12 A Proofs for the hierarchy-aligned spectral theory This appendix gives the formal proofs for the theoretical claims in Sections 3.1 to 3.3. The purpose of the organization below is to ma...
work page 2004
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.