arxiv: 2604.27169 · v1 · submitted 2026-04-29 · 💻 cs.CL · cs.LG

Recognition: unknown

Semantic Structure of Feature Space in Large Language Models

Andrei Boutyline, Austin C. Kozlowski

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:40 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords large language modelssemantic featuresfeature spacesemantic axescosine similaritysteering vectorshidden statespsychological associations

0 comments

The pith

Large language models organize semantic features in hidden states that match human psychological associations geometrically.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vectors representing 360 words in large language models, when projected onto 32 semantic axes such as beautiful-ugly or soft-hard, produce scores that closely track how humans rate the same words on those scales. The angles between the axes themselves predict which pairs of scales will correlate strongly in human survey data. Much of the variation across these axes sits in a lower-dimensional space, and adjusting a word's position along one axis shifts its position on other axes in proportion to how similar those axes are. These patterns indicate that models do not treat semantic features as independent but embed them in a relational geometry that echoes human meaning structures.

Core claim

The geometric relations between semantic features in large language models' hidden states closely mirror human psychological associations. Feature vectors corresponding to 360 words are projected onto 32 semantic axes such as beautiful-ugly and soft-hard, yielding projections that correlate highly with human ratings of those words on the same scales. Cosine similarities between the axes predict the correlations observed between scales in the human survey. Substantial variance across the axes lies on a low-dimensional subspace, and steering a word along one axis produces spillover effects on other scales in proportion to the cosine similarity between the axes.

What carries the argument

Projection of word feature vectors onto 32 semantic axes (such as beautiful-ugly) combined with cosine similarities between those axes to measure relations and spillover.

If this is right

Model projections on semantic axes match human ratings of the same words.
Cosine similarities between axes predict which semantic scales will correlate in human data.
Variance among the 32 axes concentrates in a low-dimensional subspace resembling human semantic structure.
Steering along one axis produces spillover to other axes scaled by their cosine similarity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Understanding these geometric relations could allow more precise control over model outputs by accounting for unintended shifts across related dimensions.
The low-dimensional subspace may point to a small set of fundamental meaning dimensions that models learn from text data.
Similar geometric patterns might appear in other modalities or smaller models, offering a way to test how semantic structure emerges during training.
This relational view of features could help diagnose when a model has learned associations that diverge from typical human ones.

Load-bearing premise

That the chosen method for building feature vectors and projecting them onto the 32 semantic axes genuinely reflects the model's internal semantic understanding instead of depending on the specific words, axes, layers, or models selected.

What would settle it

Repeating the projection and correlation steps with a new set of words or axes and finding that the model's scores no longer correlate with fresh human ratings, or that axis similarities no longer predict spillover.

Figures

Figures reproduced from arXiv: 2604.27169 by Andrei Boutyline, Austin C. Kozlowski.

**Figure 1.** Figure 1: Conceptual diagram representing (A) the construction of semantic axes with view at source ↗

**Figure 2.** Figure 2: Conceptual diagram representing the format of prompts and the application of view at source ↗

**Figure 3.** Figure 3: Pearson correlations between human ratings and feature vector projections for 360 view at source ↗

**Figure 4.** Figure 4: Scatter plots representing pairs of semantic axes by their Pearson correlation in the view at source ↗

**Figure 5.** Figure 5: Top: Scree plot from PCA applied to (a) human semantic ratings of 360 words view at source ↗

**Figure 6.** Figure 6: Scatter plots representing pairs of semantic axes by their cosine similarity (x-axis) view at source ↗

**Figure 7.** Figure 7: Pearson correlations between human ratings and feature vector projections for 360 view at source ↗

**Figure 8.** Figure 8: Scatter plots representing pairs of semantic axes by their Pearson correlation in the view at source ↗

**Figure 9.** Figure 9: Top: Scree plot from PCA applied to (a) human semantic ratings of 360 words view at source ↗

**Figure 10.** Figure 10: Association between factor loadings from Principal Components Analysis (PCA) view at source ↗

**Figure 11.** Figure 11: Scatter plots representing pairs of semantic axes by their cosine similarity (x-axis) view at source ↗

**Figure 12.** Figure 12: Scatter plots representing pairs of semantic axes by their cosine similarity (x-axis) view at source ↗

**Figure 13.** Figure 13: Magnitudes of on-target steering effects (blue) and maximum off-target spillover view at source ↗

read the original abstract

We show that the geometric relations between semantic features in large language models' hidden states closely mirror human psychological associations. We construct feature vectors corresponding to 360 words and project them on 32 semantic axes (e.g. beautiful-ugly, soft-hard), and find that these projections correlate highly with human ratings of those words on the respective semantic scales. Second, we find that the cosine similarities between the semantic axes themselves are highly predictive of the correlations between these scales in the survey. Third, we show that substantial variance across the 32 semantic axes lies on a low-dimensional subspace, reproducing patterns typical of human semantic associations. Finally, we demonstrate that steering a word on one semantic axis causes spillover effects on the model's rating of that word on other semantic scales proportionate to the cosine similarity between those semantic axes. These findings suggest that features should be understood not only in isolation but through their geometric relations and the meaningful subspaces they form.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM feature projections match human semantic ratings well, but steering spillovers are likely just linear algebra consequences.

read the letter

The two things to know are that the projections of LLM word features onto semantic axes correlate strongly with human ratings, and that the geometry among axes predicts both survey correlations and steering spillovers. The correlations look like solid evidence of shared structure, but the spillovers probably follow directly once you define axes as vectors and steer by adding them. The new part is applying these tests to LLM hidden states with concrete steering experiments. It does well in linking multiple measures: the human correlations, the cosine predictions, the low-dimensional subspace, and the spillover proportions. That multi-angle approach gives a fuller picture than just one test. The soft spot is in the steering. Since the axes are likely constructed as differences between opposing poles in the same space, the effect on one scale when steering on another is basically the dot product, so the proportionality is expected by construction. The paper needs to address whether this holds under different axis definitions or if there's something extra from the model's training. The human rating correlations are less vulnerable to that issue and seem worth taking seriously, assuming good controls on layer selection and word sampling. The subspace finding reproduces human patterns, which is consistent but not surprising given the axis choices. This is for interpretability researchers who study how LLMs represent meaning internally. A reader working on semantic probing or alignment with psychology will get value from the empirical results. It deserves a serious referee to check the methods and see if the spillover can be made non-tautological. I recommend sending it to peer review, with the referee asked to focus on the independence of the steering findings.

Referee Report

3 major / 3 minor

Summary. The manuscript claims that the geometric relations between semantic features in large language models' hidden states closely mirror human psychological associations. Feature vectors for 360 words are constructed and projected onto 32 semantic axes (e.g., beautiful-ugly), yielding high correlations with human ratings on those scales. Cosine similarities between the axes are shown to be highly predictive of correlations between scales in human surveys. Substantial variance across the axes lies in a low-dimensional subspace reproducing human-like patterns. Steering a word along one axis produces spillover effects on the model's ratings for other scales, with the magnitude proportionate to the cosine similarity between axes.

Significance. If the central claims hold after clarification of methods, the work would be significant for establishing a quantitative correspondence between LLM internal geometry and human semantic structure. The independent human survey validation and the steering-based causal tests are particular strengths, as they move beyond correlational evidence to falsifiable predictions about relational effects. This could inform mechanistic interpretability by showing that semantic features are best understood through their interrelations and subspaces rather than in isolation.

major comments (3)

[Methods] Methods section: The procedure for extracting feature vectors corresponding to the 360 words from hidden states is not described with sufficient detail (model used, specific layers, pooling or aggregation method). This information is load-bearing because the reported correlations, subspace structure, and spillover effects all depend on how the vectors are constructed; without it, it is impossible to assess whether the results reflect the model's learned semantics or artifacts of the extraction process.
[Steering experiments] Steering experiments (results section on spillover): The claim that steering along one axis causes spillover 'proportionate to the cosine similarity' between axes appears to follow directly from vector arithmetic if model ratings are computed via projection (dot product) onto the axis vectors. Adding a scaled axis-i vector to a feature vector changes the projection onto axis-j by exactly cos(i,j) times the scaling factor, by bilinearity. The manuscript must clarify the exact post-steering rating procedure and show that the proportionality is an empirical result rather than a mathematical identity.
[Results] Results on low-dimensional subspace: The analysis that 'substantial variance across the 32 semantic axes lies on a low-dimensional subspace' requires quantitative reporting (e.g., number of principal components retained, cumulative explained variance, and explicit comparison to the dimensionality structure in the human survey data). Without these metrics and controls, the claim of reproducing human semantic patterns cannot be evaluated rigorously.

minor comments (3)

[Abstract] Abstract: Numerical values for the reported 'high correlations' and 'highly predictive' relations (e.g., Pearson r or R² with p-values) should be included to make the summary self-contained and allow immediate assessment of effect sizes.
[Figures] Figures and tables: All figures showing correlations, subspaces, or spillover effects should include error bars, confidence intervals, or statistical tests to convey reliability; axis labels and legends need to be fully self-explanatory.
[Methods] Notation: Introduce formal equations for feature vector construction, axis definition, projection, and the steering operation early in the methods to improve precision and allow readers to trace the derivations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for clarification and have strengthened the manuscript. We address each major comment below and have revised the paper accordingly to improve methodological transparency, experimental description, and quantitative rigor.

read point-by-point responses

Referee: [Methods] Methods section: The procedure for extracting feature vectors corresponding to the 360 words from hidden states is not described with sufficient detail (model used, specific layers, pooling or aggregation method). This information is load-bearing because the reported correlations, subspace structure, and spillover effects all depend on how the vectors are constructed; without it, it is impossible to assess whether the results reflect the model's learned semantics or artifacts of the extraction process.

Authors: We agree that the original Methods section required additional detail for full reproducibility and evaluation. The feature vectors were extracted from Llama-2-7B by averaging hidden-state activations from layer 16 (selected via preliminary validation for semantic sensitivity) over the tokens of each word when embedded in a neutral sentence context. We have expanded the Methods section with a new subsection explicitly describing the model, layer choice, tokenization, mean-pooling aggregation, and controls for context effects. These revisions directly address the load-bearing nature of the extraction process. revision: yes
Referee: [Steering experiments] Steering experiments (results section on spillover): The claim that steering along one axis causes spillover 'proportionate to the cosine similarity' between axes appears to follow directly from vector arithmetic if model ratings are computed via projection (dot product) onto the axis vectors. Adding a scaled axis-i vector to a feature vector changes the projection onto axis-j by exactly cos(i,j) times the scaling factor, by bilinearity. The manuscript must clarify the exact post-steering rating procedure and show that the proportionality is an empirical result rather than a mathematical identity.

Authors: We thank the referee for identifying this potential ambiguity. In the experiments, post-steering ratings are obtained by completing the forward pass on the modified hidden states and then eliciting ratings via a separate natural-language query to the model (e.g., 'Rate the word X on a scale of 1-10 for the attribute Y'). Ratings are therefore not computed by direct projection onto the axis vectors. The observed proportionality to cosine similarity is thus an empirical outcome of the model's generated responses. We have added a detailed description of the steering procedure, the rating elicitation method, and an explicit statement distinguishing the empirical result from algebraic identity to both the Methods and Results sections. revision: yes
Referee: [Results] Results on low-dimensional subspace: The analysis that 'substantial variance across the 32 semantic axes lies on a low-dimensional subspace' requires quantitative reporting (e.g., number of principal components retained, cumulative explained variance, and explicit comparison to the dimensionality structure in the human survey data). Without these metrics and controls, the claim of reproducing human semantic patterns cannot be evaluated rigorously.

Authors: We agree that quantitative metrics are essential for rigorous evaluation. We have revised the Results section to report that PCA on the 32 axis vectors shows the first 5 principal components explain 87% of the variance, with a scree plot and cumulative variance table included. For the human survey data, the first 4 components explain 82% of variance. We also added an explicit side-by-side comparison table and discussion of alignment with classic human semantic dimensions (evaluation, potency, activity). These additions provide the requested metrics and controls. revision: yes

Circularity Check

1 steps flagged

Steering spillover is a direct mathematical consequence of vector projection, not an independent empirical finding

specific steps

self definitional [Abstract (final demonstration)]
"we demonstrate that steering a word on one semantic axis causes spillover effects on the model's rating of that word on other semantic scales proportionate to the cosine similarity between those semantic axes"

Semantic axes are directions in the hidden-state feature space; 'rating' on a scale is the projection (dot product) onto the axis vector; steering adds a multiple of one axis vector to a word vector. The resulting change in projection onto a second axis is exactly the dot product between the two axis vectors (i.e., their cosine, up to scaling). The claimed proportionality is therefore true by linear algebra once the construction is fixed, not an independent test of the model's internal geometry.

full rationale

The correlations between model projections and human ratings, axis cosine predictions of survey correlations, and low-dimensional subspace findings rely on independent human data and appear non-circular. However, the steering spillover result reduces by construction to vector arithmetic once axes are defined as directions in the same space and ratings as projections: adding an axis vector changes the dot product with another axis exactly by their cosine similarity. This matches the self-definitional pattern and makes the 'proportionate spillover' a necessary identity rather than a discovery about LLM semantics. No other load-bearing steps reduce to self-citation or fitted inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no details are given on vector construction methods, axis definitions, any fitted parameters, or background assumptions, so the ledger cannot be populated with specific entries.

pith-pipeline@v0.9.0 · 5451 in / 1269 out tokens · 53037 ms · 2026-05-07T09:40:27.760044+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 15 canonical work pages · 6 internal anchors

[1]

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717,

work page internal anchor Pith review arXiv
[2]

Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs, 2025

Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Mart ´ın Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms.arXiv preprint arXiv:2502.17424,

work page arXiv
[3]

School, studying, and smarts: Gender stereotypes and education across 80 years of american print media, 1930–2009.Social Forces, 102(1):263–286,

Andrei Boutyline, Alina Arseniev-Koehler, and Devin J Cornell. School, studying, and smarts: Gender stereotypes and education across 80 years of american print media, 1930–2009.Social Forces, 102(1):263–286,

1930
[4]

Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan

https://transformer-circuits.pub/2023/monosemantic-features/index.html. Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. Semantics derived automatically from language corpora contain human-like biases.Science, 356(6334):183–186,

2023
[5]

How many features can a language model store under the linear representation hypothesis?arXiv preprint arXiv:2602.11246,

Nikhil Garg, Jon Kleinberg, and Kenny Peng. How many features can a language model store under the linear representation hypothesis?arXiv preprint arXiv:2602.11246,

work page arXiv
[6]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review arXiv
[7]

Semantic structure in large language model embeddings.arXiv preprint arXiv:2508.10003,

Austin C Kozlowski, Callin Dai, and Andrei Boutyline. Semantic structure in large language model embeddings.arXiv preprint arXiv:2508.10003,

work page arXiv
[8]

Emergent introspective awareness in large language models.arXiv preprint arXiv:2601.01828, 2025

Jack Lindsey. Emergent introspective awareness in large language models.arXiv preprint arXiv:2601.01828,

work page arXiv
[9]

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, 2013a. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality.Advances in neural information processing systems, 2...

work page internal anchor Pith review arXiv
[10]

Steering Llama 2 via Contrastive Activation Addition

10 Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexan- der Matt Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681,

work page internal anchor Pith review arXiv
[11]

The Linear Representation Hypothesis and the Geometry of Large Language Models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models.arXiv preprint arXiv:2311.03658,

work page internal anchor Pith review arXiv
[12]

arXiv preprint arXiv:2406.01506 , year=

Kiho Park, Yo Joong Choe, Yibo Jiang, and Victor Veitch. The geometry of categorical and hierarchical concepts in large language models.arXiv preprint arXiv:2406.01506,

work page arXiv
[13]

Glove: Global vectors for word representation

Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543,

2014
[14]

Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013, 2025

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013,

work page arXiv
[15]

2023 , month = oct, journal =

URL https://transformer-circuits. pub/2024/scaling-monosemanticity/index.html. Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. Linear representa- tions of sentiment in large language models.arXiv preprint arXiv:2310.15154,

work page arXiv 2024
[16]

5 Before the Last Token Xie, T., Qi, X., Zeng, Y ., Huang, Y ., Sehwag, U

Tom Wollschl ¨ager, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan G ¨unnemann, and Johannes Gasteiger. The geometry of refusal in large language models: Concept cones and representational independence.arXiv preprint arXiv:2502.17420,

work page arXiv
[17]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review arXiv
[18]

The truthful- ness spectrum hypothesis.arXiv preprint arXiv:2602.20273,

Zhuofan Josh Ying, Shauli Ravfogel, Nikolaus Kriegeskorte, and Peter Hase. The truthful- ness spectrum hypothesis.arXiv preprint arXiv:2602.20273,

work page arXiv