Recognition: unknown
Semantic Structure of Feature Space in Large Language Models
Pith reviewed 2026-05-07 09:40 UTC · model grok-4.3
The pith
Large language models organize semantic features in hidden states that match human psychological associations geometrically.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The geometric relations between semantic features in large language models' hidden states closely mirror human psychological associations. Feature vectors corresponding to 360 words are projected onto 32 semantic axes such as beautiful-ugly and soft-hard, yielding projections that correlate highly with human ratings of those words on the same scales. Cosine similarities between the axes predict the correlations observed between scales in the human survey. Substantial variance across the axes lies on a low-dimensional subspace, and steering a word along one axis produces spillover effects on other scales in proportion to the cosine similarity between the axes.
What carries the argument
Projection of word feature vectors onto 32 semantic axes (such as beautiful-ugly) combined with cosine similarities between those axes to measure relations and spillover.
If this is right
- Model projections on semantic axes match human ratings of the same words.
- Cosine similarities between axes predict which semantic scales will correlate in human data.
- Variance among the 32 axes concentrates in a low-dimensional subspace resembling human semantic structure.
- Steering along one axis produces spillover to other axes scaled by their cosine similarity.
Where Pith is reading between the lines
- Understanding these geometric relations could allow more precise control over model outputs by accounting for unintended shifts across related dimensions.
- The low-dimensional subspace may point to a small set of fundamental meaning dimensions that models learn from text data.
- Similar geometric patterns might appear in other modalities or smaller models, offering a way to test how semantic structure emerges during training.
- This relational view of features could help diagnose when a model has learned associations that diverge from typical human ones.
Load-bearing premise
That the chosen method for building feature vectors and projecting them onto the 32 semantic axes genuinely reflects the model's internal semantic understanding instead of depending on the specific words, axes, layers, or models selected.
What would settle it
Repeating the projection and correlation steps with a new set of words or axes and finding that the model's scores no longer correlate with fresh human ratings, or that axis similarities no longer predict spillover.
Figures
read the original abstract
We show that the geometric relations between semantic features in large language models' hidden states closely mirror human psychological associations. We construct feature vectors corresponding to 360 words and project them on 32 semantic axes (e.g. beautiful-ugly, soft-hard), and find that these projections correlate highly with human ratings of those words on the respective semantic scales. Second, we find that the cosine similarities between the semantic axes themselves are highly predictive of the correlations between these scales in the survey. Third, we show that substantial variance across the 32 semantic axes lies on a low-dimensional subspace, reproducing patterns typical of human semantic associations. Finally, we demonstrate that steering a word on one semantic axis causes spillover effects on the model's rating of that word on other semantic scales proportionate to the cosine similarity between those semantic axes. These findings suggest that features should be understood not only in isolation but through their geometric relations and the meaningful subspaces they form.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that the geometric relations between semantic features in large language models' hidden states closely mirror human psychological associations. Feature vectors for 360 words are constructed and projected onto 32 semantic axes (e.g., beautiful-ugly), yielding high correlations with human ratings on those scales. Cosine similarities between the axes are shown to be highly predictive of correlations between scales in human surveys. Substantial variance across the axes lies in a low-dimensional subspace reproducing human-like patterns. Steering a word along one axis produces spillover effects on the model's ratings for other scales, with the magnitude proportionate to the cosine similarity between axes.
Significance. If the central claims hold after clarification of methods, the work would be significant for establishing a quantitative correspondence between LLM internal geometry and human semantic structure. The independent human survey validation and the steering-based causal tests are particular strengths, as they move beyond correlational evidence to falsifiable predictions about relational effects. This could inform mechanistic interpretability by showing that semantic features are best understood through their interrelations and subspaces rather than in isolation.
major comments (3)
- [Methods] Methods section: The procedure for extracting feature vectors corresponding to the 360 words from hidden states is not described with sufficient detail (model used, specific layers, pooling or aggregation method). This information is load-bearing because the reported correlations, subspace structure, and spillover effects all depend on how the vectors are constructed; without it, it is impossible to assess whether the results reflect the model's learned semantics or artifacts of the extraction process.
- [Steering experiments] Steering experiments (results section on spillover): The claim that steering along one axis causes spillover 'proportionate to the cosine similarity' between axes appears to follow directly from vector arithmetic if model ratings are computed via projection (dot product) onto the axis vectors. Adding a scaled axis-i vector to a feature vector changes the projection onto axis-j by exactly cos(i,j) times the scaling factor, by bilinearity. The manuscript must clarify the exact post-steering rating procedure and show that the proportionality is an empirical result rather than a mathematical identity.
- [Results] Results on low-dimensional subspace: The analysis that 'substantial variance across the 32 semantic axes lies on a low-dimensional subspace' requires quantitative reporting (e.g., number of principal components retained, cumulative explained variance, and explicit comparison to the dimensionality structure in the human survey data). Without these metrics and controls, the claim of reproducing human semantic patterns cannot be evaluated rigorously.
minor comments (3)
- [Abstract] Abstract: Numerical values for the reported 'high correlations' and 'highly predictive' relations (e.g., Pearson r or R² with p-values) should be included to make the summary self-contained and allow immediate assessment of effect sizes.
- [Figures] Figures and tables: All figures showing correlations, subspaces, or spillover effects should include error bars, confidence intervals, or statistical tests to convey reliability; axis labels and legends need to be fully self-explanatory.
- [Methods] Notation: Introduce formal equations for feature vector construction, axis definition, projection, and the steering operation early in the methods to improve precision and allow readers to trace the derivations.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for clarification and have strengthened the manuscript. We address each major comment below and have revised the paper accordingly to improve methodological transparency, experimental description, and quantitative rigor.
read point-by-point responses
-
Referee: [Methods] Methods section: The procedure for extracting feature vectors corresponding to the 360 words from hidden states is not described with sufficient detail (model used, specific layers, pooling or aggregation method). This information is load-bearing because the reported correlations, subspace structure, and spillover effects all depend on how the vectors are constructed; without it, it is impossible to assess whether the results reflect the model's learned semantics or artifacts of the extraction process.
Authors: We agree that the original Methods section required additional detail for full reproducibility and evaluation. The feature vectors were extracted from Llama-2-7B by averaging hidden-state activations from layer 16 (selected via preliminary validation for semantic sensitivity) over the tokens of each word when embedded in a neutral sentence context. We have expanded the Methods section with a new subsection explicitly describing the model, layer choice, tokenization, mean-pooling aggregation, and controls for context effects. These revisions directly address the load-bearing nature of the extraction process. revision: yes
-
Referee: [Steering experiments] Steering experiments (results section on spillover): The claim that steering along one axis causes spillover 'proportionate to the cosine similarity' between axes appears to follow directly from vector arithmetic if model ratings are computed via projection (dot product) onto the axis vectors. Adding a scaled axis-i vector to a feature vector changes the projection onto axis-j by exactly cos(i,j) times the scaling factor, by bilinearity. The manuscript must clarify the exact post-steering rating procedure and show that the proportionality is an empirical result rather than a mathematical identity.
Authors: We thank the referee for identifying this potential ambiguity. In the experiments, post-steering ratings are obtained by completing the forward pass on the modified hidden states and then eliciting ratings via a separate natural-language query to the model (e.g., 'Rate the word X on a scale of 1-10 for the attribute Y'). Ratings are therefore not computed by direct projection onto the axis vectors. The observed proportionality to cosine similarity is thus an empirical outcome of the model's generated responses. We have added a detailed description of the steering procedure, the rating elicitation method, and an explicit statement distinguishing the empirical result from algebraic identity to both the Methods and Results sections. revision: yes
-
Referee: [Results] Results on low-dimensional subspace: The analysis that 'substantial variance across the 32 semantic axes lies on a low-dimensional subspace' requires quantitative reporting (e.g., number of principal components retained, cumulative explained variance, and explicit comparison to the dimensionality structure in the human survey data). Without these metrics and controls, the claim of reproducing human semantic patterns cannot be evaluated rigorously.
Authors: We agree that quantitative metrics are essential for rigorous evaluation. We have revised the Results section to report that PCA on the 32 axis vectors shows the first 5 principal components explain 87% of the variance, with a scree plot and cumulative variance table included. For the human survey data, the first 4 components explain 82% of variance. We also added an explicit side-by-side comparison table and discussion of alignment with classic human semantic dimensions (evaluation, potency, activity). These additions provide the requested metrics and controls. revision: yes
Circularity Check
Steering spillover is a direct mathematical consequence of vector projection, not an independent empirical finding
specific steps
-
self definitional
[Abstract (final demonstration)]
"we demonstrate that steering a word on one semantic axis causes spillover effects on the model's rating of that word on other semantic scales proportionate to the cosine similarity between those semantic axes"
Semantic axes are directions in the hidden-state feature space; 'rating' on a scale is the projection (dot product) onto the axis vector; steering adds a multiple of one axis vector to a word vector. The resulting change in projection onto a second axis is exactly the dot product between the two axis vectors (i.e., their cosine, up to scaling). The claimed proportionality is therefore true by linear algebra once the construction is fixed, not an independent test of the model's internal geometry.
full rationale
The correlations between model projections and human ratings, axis cosine predictions of survey correlations, and low-dimensional subspace findings rely on independent human data and appear non-circular. However, the steering spillover result reduces by construction to vector arithmetic once axes are defined as directions in the same space and ratings as projections: adding an axis vector changes the dot product with another axis exactly by their cosine similarity. This matches the self-definitional pattern and makes the 'proportionate spillover' a necessary identity rather than a discovery about LLM semantics. No other load-bearing steps reduce to self-citation or fitted inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717,
work page internal anchor Pith review arXiv
-
[2]
Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs, 2025
Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Mart ´ın Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms.arXiv preprint arXiv:2502.17424,
-
[3]
School, studying, and smarts: Gender stereotypes and education across 80 years of american print media, 1930–2009.Social Forces, 102(1):263–286,
Andrei Boutyline, Alina Arseniev-Koehler, and Devin J Cornell. School, studying, and smarts: Gender stereotypes and education across 80 years of american print media, 1930–2009.Social Forces, 102(1):263–286,
1930
-
[4]
Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan
https://transformer-circuits.pub/2023/monosemantic-features/index.html. Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. Semantics derived automatically from language corpora contain human-like biases.Science, 356(6334):183–186,
2023
-
[5]
Nikhil Garg, Jon Kleinberg, and Kenny Peng. How many features can a language model store under the linear representation hypothesis?arXiv preprint arXiv:2602.11246,
-
[6]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review arXiv
-
[7]
Semantic structure in large language model embeddings.arXiv preprint arXiv:2508.10003,
Austin C Kozlowski, Callin Dai, and Andrei Boutyline. Semantic structure in large language model embeddings.arXiv preprint arXiv:2508.10003,
-
[8]
Emergent introspective awareness in large language models.arXiv preprint arXiv:2601.01828, 2025
Jack Lindsey. Emergent introspective awareness in large language models.arXiv preprint arXiv:2601.01828,
-
[9]
Efficient Estimation of Word Representations in Vector Space
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, 2013a. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality.Advances in neural information processing systems, 2...
work page internal anchor Pith review arXiv
-
[10]
Steering Llama 2 via Contrastive Activation Addition
10 Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexan- der Matt Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681,
work page internal anchor Pith review arXiv
-
[11]
The Linear Representation Hypothesis and the Geometry of Large Language Models
Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models.arXiv preprint arXiv:2311.03658,
work page internal anchor Pith review arXiv
-
[12]
arXiv preprint arXiv:2406.01506 , year=
Kiho Park, Yo Joong Choe, Yibo Jiang, and Victor Veitch. The geometry of categorical and hierarchical concepts in large language models.arXiv preprint arXiv:2406.01506,
-
[13]
Glove: Global vectors for word representation
Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543,
2014
-
[14]
Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013,
-
[15]
URL https://transformer-circuits. pub/2024/scaling-monosemanticity/index.html. Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. Linear representa- tions of sentiment in large language models.arXiv preprint arXiv:2310.15154,
-
[16]
5 Before the Last Token Xie, T., Qi, X., Zeng, Y ., Huang, Y ., Sehwag, U
Tom Wollschl ¨ager, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan G ¨unnemann, and Johannes Gasteiger. The geometry of refusal in large language models: Concept cones and representational independence.arXiv preprint arXiv:2502.17420,
-
[17]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review arXiv
-
[18]
The truthful- ness spectrum hypothesis.arXiv preprint arXiv:2602.20273,
Zhuofan Josh Ying, Shauli Ravfogel, Nikolaus Kriegeskorte, and Peter Hase. The truthful- ness spectrum hypothesis.arXiv preprint arXiv:2602.20273,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.