Where Do Models Find Happiness? Emotion Vectors in Open-Source LLMs

Mennatallah El-Assady; Rapha\"el Baur; Sinie van der Ben; Yannick Metz

arxiv: 2606.26987 · v1 · pith:JFKXQSZ3new · submitted 2026-06-25 · 💻 cs.CL · cs.AI

Where Do Models Find Happiness? Emotion Vectors in Open-Source LLMs

Sinie van der Ben , Rapha\"el Baur , Yannick Metz , Mennatallah El-Assady This is my paper

Pith reviewed 2026-06-26 04:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords emotion vectorsvalencearousalcontrast vectorslanguage modelsopen-weight modelsprincipal components

0 comments

The pith

Emotion contrast vectors in two open-weight models recover valence geometry with correlations approaching those in Claude.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether emotion vectors previously identified in a closed model generalize to open-weight language models. It extracts contrast vectors from two models using their own generated stories and measures how well the vectors' principal components align with human-rated valence and arousal. Both models show strong valence alignment on the first component, though the layers where this alignment peaks differ markedly between the models. Arousal alignment proves weaker and varies with which model's stories are used to build the vectors. The work establishes that the basic structure is not limited to proprietary systems.

Core claim

We recover valence geometry for both models, with peak PC1--valence correlations of r = 0.76 and r = 0.83, approaching the r = 0.81 reported for Claude. Beyond replication, we observe notable differences in how valence representations emerge across model depth. In Gemma-4-E4B-it, valence is strongly encoded in early layers but collapses towards later layers, whereas Apertus-8B-Instruct-2509 exhibits the opposite pattern, with valence representations absent in early layers, but emerging at mid depths. Arousal encoding, in contrast, is sensitive to the extraction corpus.

What carries the argument

Emotion contrast vectors extracted layer-wise from model-generated corpora and projected onto principal components.

If this is right

Valence geometry appears in open models at strengths comparable to closed models.
The depth at which valence is most strongly encoded depends on the specific model architecture.
Arousal alignment is weaker overall and changes with the choice of generated corpus.
Emotion concepts can be isolated as linear directions in open-weight model activations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The layer-wise differences suggest that emotion geometry may be shaped by training details or architecture choices.
Corpus sensitivity for arousal implies that richer or more varied text sources could strengthen secondary dimensions.
Because the vectors are recoverable in open models, they become directly editable for studying causal effects on output behavior.

Load-bearing premise

The two model-generated corpora contain sufficiently rich and unbiased emotion-relevant linguistic cues for the contrast vectors to isolate internal emotion concepts rather than corpus artifacts.

What would settle it

Repeating the extraction on a new open model or with human-authored emotion stories and obtaining no significant PC1-valence correlation.

Figures

Figures reproduced from arXiv: 2606.26987 by Mennatallah El-Assady, Rapha\"el Baur, Sinie van der Ben, Yannick Metz.

**Figure 1.** Figure 1: Pearson correlation between the top two PCs of the emotion-vector space and human valence (left, PC1) and arousal (right, PC2) across fractional layer depth, for Apertus 8B and GEMMA-4-E4B probed on Apertus- and Gemma-generated stories (four conditions). Hue = model (blue = Apertus, red = Gemma); line style = story source (solid = Apertus, dashed = Gemma). Dotted gray lines mark the Sonnet 4.5 reference a… view at source ↗

**Figure 2.** Figure 2: APERTUS-8B CKA results on Apertus stories 2 3 4 5 6 7 8 9 10 11 layer 2 3 4 5 6 7 8 9 10 11 layer 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1… view at source ↗

**Figure 3.** Figure 3: APERTUS-8B CKA values on Gemma stories 2 3 4 5 6 7 8 9 10 11 layer 2 3 4 5 6 7 8 9 10 11 layer 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00… view at source ↗

**Figure 4.** Figure 4: GEMMA-4-E4B CKA values on Gemma stories 2 3 4 5 6 7 8 9 10 11 12 13 14 layer 2 3 4 5 6 7 8 9 10 11 12 13 14 layer 1.00 0.91 0.86 0.83 0.81 0.81 0.82 0.83 0.81 0.80 0.80 0.77 0.76 0.91 1.00 0.95 0.90 0.89 0.91 0.89 0.85 0.84 0.87 0.87 0.86 0.85 0.86 0.95 1.00 0.95 0.93 0.94 0.89 0.85 0.83 0.88 0.89 0.88 0.86 0.83 0.90 0.95 1.00 0.98 0.95 0.89 0.86 0.85 0.89 0.89 0.87 0.86 0.81 0.89 0.93 0.98 1.00 0.98 0.92 … view at source ↗

**Figure 5.** Figure 5: GEMMA-4-E4B CKA values on Apertus stories 2 3 4 5 6 7 8 9 10 11 12 13 14 layer 2 3 4 5 6 7 8 9 10 11 12 13 14 layer 1.00 0.94 0.87 0.85 0.83 0.82 0.85 0.87 0.87 0.82 0.80 0.74 0.73 0.94 1.00 0.96 0.92 0.91 0.91 0.91 0.91 0.91 0.90 0.88 0.85 0.83 0.87 0.96 1.00 0.96 0.95 0.95 0.93 0.91 0.91 0.92 0.91 0.89 0.87 0.85 0.92 0.96 1.00 0.98 0.96 0.91 0.89 0.90 0.92 0.91 0.88 0.87 0.83 0.91 0.95 0.98 1.00 0.99 0.9… view at source ↗

**Figure 6.** Figure 6: APERTUS-8B validation on Apertus stories 2 3 4 5 6 7 8 9 10 11 layer 2 3 4 5 6 7 8 9 10 11 layer 1.00 -0.15 0.21 0.35 -0.04 0.16 -0.02 0.10 0.16 -0.30 -0.15 1.00 -0.09 -0.24 0.04 -0.14 0.02 -0.05 -0.08 0.21 0.21 -0.09 1.00 0.11 -0.15 0.05 0.02 0.08 0.03 -0.13 0.35 -0.24 0.11 1.00 0.08 0.30 -0.15 0.20 0.27 -0.49 -0.04 0.04 -0.15 0.08 1.00 0.00 -0.10 0.06 0.13 -0.00 0.16 -0.14 0.05 0.30 0.00 1.00 -0.17 0.10 … view at source ↗

**Figure 7.** Figure 7: APERTUS-8B validation on Gemma stories 2 3 4 5 6 7 8 9 10 11 layer 2 3 4 5 6 7 8 9 10 11 layer 1.00 -0.11 -0.29 0.35 0.38 -0.26 -0.25 -0.27 -0.28 -0.23 -0.11 1.00 0.15 0.02 0.02 -0.03 0.01 -0.01 -0.06 -0.03 -0.29 0.15 1.00 -0.25 -0.24 0.19 0.14 0.16 0.13 0.13 0.35 0.02 -0.25 1.00 0.57 -0.43 -0.28 -0.37 -0.35 -0.28 0.38 0.02 -0.24 0.57 1.00 -0.54 -0.41 -0.46 -0.43 -0.35 -0.26 -0.03 0.19 -0.43 -0.54 1.00 0.3… view at source ↗

**Figure 8.** Figure 8: APERTUS-8B validation on Apertus stories 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 layer (midpoint of pair) 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 cosine similarity -0.15 -0.09 0.11 0.08 0.00 -0.17 -0.14 0.10 -0.36 -0.42 -0.13 -0.20 -0.21 -0.26 -0.23 -0.32 -0.15 0.20 0.21 -0.15 0.07 -0.09 0.25 0.15 0.12 -0.07 -0.29 -0.13 0.00 Apertus 8B Valence direction: adjac… view at source ↗

**Figure 10.** Figure 10: GEMMA-4-E4B validation on Gemma stories 2 3 4 5 6 7 8 9 10 11 12 13 14 layer 2 3 4 5 6 7 8 9 10 11 12 13 14 layer 1.00 -0.35 -0.37 0.31 -0.06 -0.17 0.29 0.03 0.02 0.03 0.01 0.00 0.02 -0.35 1.00 0.24 -0.15 0.06 0.13 -0.12 -0.00 -0.02 -0.00 0.02 0.00 0.01 -0.37 0.24 1.00 -0.46 0.14 0.25 -0.36 -0.01 0.02 0.00 0.05 0.01 0.01 0.31 -0.15 -0.46 1.00 -0.19 -0.25 0.31 0.03 0.01 0.00 0.00 -0.02 0.02 -0.06 0.06 0.14… view at source ↗

**Figure 11.** Figure 11: GEMMA-4-E4B validation on Apertus stories 2 3 4 5 6 7 8 9 10 11 12 13 14 layer 2 3 4 5 6 7 8 9 10 11 12 13 14 layer 1.00 0.01 0.03 0.05 0.06 -0.03 -0.00 0.00 0.00 0.02 0.01 -0.02 0.01 0.01 1.00 0.14 0.09 0.10 0.02 0.03 -0.03 -0.02 0.02 0.00 0.00 0.02 0.03 0.14 1.00 0.36 0.35 0.01 -0.16 0.04 0.03 -0.03 -0.02 0.00 -0.04 0.05 0.09 0.36 1.00 0.41 -0.03 -0.08 -0.03 0.00 0.01 -0.01 0.01 -0.01 0.06 0.10 0.35 0.4… view at source ↗

**Figure 13.** Figure 13: GEMMA-4-E4B on Gemma stories 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 layer (midpoint of pair) 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 cosine similarity -0.35 0.24 -0.46 -0.19 0.09 -0.25 0.04 0.35 0.54 0.29 0.21 0.32 0.20 0.42 -0.10 -0.09 0.14 0.54 -0.07 -0.03 -0.00 0.45 0.03 -0.08 -0.23 0.07 0.34 0.35 0.55 -0.25 0.16 -0.44 0.52 -0.4… view at source ↗

**Figure 14.** Figure 14: APERTUS-8B validation 600 400 200 0 200 400 600 PC1 (13.5% var) r_val=0.723 600 400 200 0 200 400 600 PC2 (12.4% var) r_aro=0.146 afraid angry calm depressed ecstatic excited furious gloomy happy joyful serene Apertus 8B L23 Apertus stories 400 200 0 200 400 600 PC1 (15.6% var) r_val=0.748 400 200 0 200 400 600 PC2 (8.7% var) r_aro=0.422 afraid angry calm depressed ecstatic excited furious gloomy happy jo… view at source ↗

**Figure 15.** Figure 15: GEMMA-4-E4B: valence-arousal PCA 2 1 0 1 2 3 4 PC1 (13.8% var) r_val=0.794 4 2 0 2 4 PC2 (12.2% var) r_aro=0.046 afraid angry depressed calm excited ecstatic furious gloomy happy joyful serene Gemma 4 8B L13 Apertus stories 3 2 1 0 1 2 3 PC1 (15.0% var) r_val=0.798 3 2 1 0 1 2 PC2 (8.9% var) r_aro=0.193 afraid angry calm depressed ecstatic excited furious gloomy happy joyful serene Gemma 4 8B L13 Gemma st… view at source ↗

read the original abstract

Recent work identified emotion vectors in Claude Sonnet 4.5, which are internal representations that encode emotion concepts, causally influence behavior, and exhibit geometry mirroring human psychological structure. We test the generality of these findings in two open-weight models, Apertus-8B-Instruct-2509 and Gemma-4-E4B-it, extracting emotion contrast vectors across all layers, using two model-generated corpora. We recover valence geometry for both models, with peak PC1--valence correlations of $r = 0.76$ and $r = 0.83$, approaching the $r = 0.81$ reported for Claude.Beyond replication, we observe notable differences in how valence representations emerge across model depth. In Gemma-4-E4B-it, valence is strongly encoded in early layers but collapses towards later layers, whereas Apertus-8B-Instruct-2509 exhibits the opposite pattern, with valence representations absent in early layers, but emerging at mid depths. Arousal encoding, in contrast, is sensitive to the extraction corpus: both models show stronger PC2--arousal alignment with Gemma-generated stories ($r$ up to $0.45$) than Apertus-generated ones ($r \leq 0.21$), suggesting arousal-relevant cues are unevenly distributed across generated corpora. We open-source our experiment code and dataset for reproducible investigation of emotion representations across language model architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Replicates valence geometry in two open models with new layer-emergence and corpus-sensitivity observations, but thin methods leave open whether the vectors track internal concepts or generation artifacts.

read the letter

The colleague should know two things: the paper recovers valence PC1 correlations of 0.76 and 0.83 in Apertus and Gemma, close to the 0.81 reported for Claude, and it adds two observations not in the prior work—valence emerges early then collapses in Gemma but appears at mid-depth in Apertus, while arousal alignment varies sharply with the model-generated corpus used for extraction.

What the paper does is straightforward replication plus those two extensions. Reporting the layer patterns and the corpus effect on arousal is useful because it shows the geometry is not uniform across architectures or data sources. Opening the code and dataset is the right move for a measurement study like this.

The soft spots are in the methods and the interpretation. The abstract gives raw correlations with no error bars, no statistical tests, no description of layer selection, and no mention of multiple-comparison correction. That makes it hard to judge whether the reported peaks are stable or post-hoc. More importantly, the explicit corpus dependence for arousal (r up to 0.45 on one corpus, ≤0.21 on the other) directly shows the contrast vectors pick up linguistic distribution differences. The paper does not report an equivalent robustness check for the valence results, so it is not yet clear whether those correlations reflect stable internal representations or the same kind of generation artifact.

This is for people already working on mechanistic interpretability of affective representations who want to see the Claude finding tested outside closed models. It deserves a serious referee because the replication target is clear and the new observations are falsifiable with the released code, even though the current write-up needs tighter statistical reporting and a direct test of whether valence is corpus-sensitive too.

Referee Report

2 major / 0 minor

Summary. The paper extends findings on emotion vectors from Claude to two open-weight models (Apertus-8B-Instruct-2509 and Gemma-4-E4B-it). Using contrast vectors extracted from two model-generated corpora, it reports recovery of valence geometry via PCA, with peak PC1-valence correlations of r=0.76 and r=0.83. It additionally documents depth-dependent differences in valence encoding across models and corpus sensitivity in arousal alignment (stronger on Gemma-generated data). Code and data are open-sourced.

Significance. Replication of valence geometry in open models would support the generality of internal emotion representations beyond proprietary systems. The reported layer-wise emergence patterns and the explicit corpus-sensitivity finding for arousal provide concrete observations that could guide future mechanistic work. Open-sourcing the experiment code and dataset is a clear strength that enables direct reproducibility and extension by others.

major comments (2)

[Results on arousal and valence encoding] Results section on arousal encoding: the manuscript explicitly reports corpus dependence for PC2-arousal alignment (r up to 0.45 on Gemma corpus vs r ≤ 0.21 on Apertus corpus) yet provides no parallel breakdown of PC1-valence correlations computed separately on each corpus. Because the central claim is that the contrast vectors isolate stable model-internal emotion concepts rather than corpus artifacts, the absence of this robustness check for the headline valence results (r=0.76/0.83) leaves the interpretation underdetermined.
[Results on valence geometry] Results on valence geometry: the peak PC1-valence correlations are stated without error bars, confidence intervals, p-values, or any description of layer-selection criteria and correction for multiple comparisons across layers. Given that the maxima are identified post-hoc by scanning all layers, these omissions prevent assessment of whether the reported values are statistically distinguishable from chance or from other layers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the two major comments below and will incorporate revisions to enhance the statistical reporting and robustness checks as suggested.

read point-by-point responses

Referee: Results on arousal and valence encoding: the manuscript explicitly reports corpus dependence for PC2-arousal alignment (r up to 0.45 on Gemma corpus vs r ≤ 0.21 on Apertus corpus) yet provides no parallel breakdown of PC1-valence correlations computed separately on each corpus. Because the central claim is that the contrast vectors isolate stable model-internal emotion concepts rather than corpus artifacts, the absence of this robustness check for the headline valence results (r=0.76/0.83) leaves the interpretation underdetermined.

Authors: We agree that a corpus-specific breakdown for the valence correlations would provide stronger evidence for the stability of the emotion vectors. Although the original analysis pooled the corpora to identify the peak correlations, we will recompute and report the PC1-valence correlations separately for each corpus in the revised manuscript. This will include the maximum r values per corpus for each model, allowing direct comparison to the arousal results and addressing concerns about corpus artifacts. revision: yes
Referee: Results on valence geometry: the peak PC1-valence correlations are stated without error bars, confidence intervals, p-values, or any description of layer-selection criteria and correction for multiple comparisons across layers. Given that the maxima are identified post-hoc by scanning all layers, these omissions prevent assessment of whether the reported values are statistically distinguishable from chance or from other layers.

Authors: The referee correctly identifies a gap in our statistical reporting. The correlations were computed layer-wise, with peaks selected as the maximum across layers. In the revision, we will add bootstrap-derived confidence intervals and standard errors for the reported correlations, p-values for the peak correlations (testing against zero), and apply a multiple-comparison correction (e.g., false discovery rate) across the layers tested. We will also explicitly describe the layer-selection procedure. These additions will allow readers to evaluate the reliability of the reported peaks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical measurement against external ratings

full rationale

The paper extracts contrast vectors from model-generated text, computes PCA, and reports direct Pearson correlations between PC1/PC2 and independent human valence/arousal ratings (r values given explicitly). No equations, fitted parameters, or self-citations reduce these correlations to inputs by construction. The work cites prior Claude results only for comparison and does not invoke uniqueness theorems, ansatzes, or renamings from overlapping authors. Corpus sensitivity for arousal is reported transparently rather than hidden, but this affects validity, not circularity of the derivation chain. The central results remain independent empirical measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that contrast vectors extracted from model-generated text isolate emotion concepts rather than generation artifacts, plus standard assumptions about PCA capturing the dominant variance in activation differences. No free parameters or invented entities are described in the abstract.

axioms (2)

domain assumption Model-generated stories contain emotion-relevant cues that are representative of the concepts the model internally represents.
Invoked when using the generated corpora to extract contrast vectors; the observed corpus sensitivity for arousal indicates this assumption may not hold uniformly.
domain assumption Principal components of contrast vectors correspond to psychologically meaningful dimensions such as valence and arousal.
Used when reporting PC1--valence and PC2--arousal correlations.

pith-pipeline@v0.9.1-grok · 5796 in / 1371 out tokens · 34416 ms · 2026-06-26T04:59:47.804311+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 1 canonical work pages

[1]

arXiv preprint arXiv:2410.19750 , year=

The Geometry of Concepts: Sparse Autoencoder Feature Structure , author=. arXiv preprint arXiv:2410.19750 , year=

arXiv
[2]

arXiv preprint arXiv:2403.19647 , year=

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models , author=. arXiv preprint arXiv:2403.19647 , year=

Pith/arXiv arXiv
[3]

arXiv preprint arXiv:2309.08600 , year=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. arXiv preprint arXiv:2309.08600 , year=

Pith/arXiv arXiv
[4]

Transformer Circuits Thread , year=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. Transformer Circuits Thread , year=
[5]

arXiv preprint arXiv:2408.05147 , year=

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 , author=. arXiv preprint arXiv:2408.05147 , year=

Pith/arXiv arXiv
[6]

arXiv preprint arXiv:2407.14435 , year=

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders , author=. arXiv preprint arXiv:2407.14435 , year=

Pith/arXiv arXiv
[7]

arXiv preprint arXiv:2406.04093 , year=

Scaling and Evaluating Sparse Autoencoders , author=. arXiv preprint arXiv:2406.04093 , year=

Pith/arXiv arXiv
[8]

Transformer Circuits Thread , year=

A Mathematical Framework for Transformer Circuits , author=. Transformer Circuits Thread , year=
[9]

Interpretability in the Wild: A Circuit for Indirect Object Identification in

Wang, Kevin and Variengien, Alexandre and Conmy, Arthur and Shlegeris, Buck and Steinhardt, Jacob , booktitle=. Interpretability in the Wild: A Circuit for Indirect Object Identification in
[10]

Advances in Neural Information Processing Systems , year=

Towards Automated Circuit Discovery for Mechanistic Interpretability , author=. Advances in Neural Information Processing Systems , year=
[11]

Advances in Neural Information Processing Systems , year=

Investigating Gender Bias in Language Models Using Causal Mediation Analysis , author=. Advances in Neural Information Processing Systems , year=
[12]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle=. Locating and Editing Factual Associations in
[13]

arXiv preprint arXiv:2304.05969 , year=

Localizing Model Behavior with Path Patching , author=. arXiv preprint arXiv:2304.05969 , year=

Pith/arXiv arXiv
[14]

arXiv preprint arXiv:1610.01644 , year=

Understanding Intermediate Layers Using Linear Classifier Probes , author=. arXiv preprint arXiv:1610.01644 , year=

Pith/arXiv arXiv
[15]

Frontiers in Systems Neuroscience , volume=

Representational Similarity Analysis -- Connecting the Branches of Systems Neuroscience , author=. Frontiers in Systems Neuroscience , volume=
[16]

Distill , year=

Feature Visualization , author=. Distill , year=
[17]

arXiv preprint , year=

Gemma 3 Technical Report , author=. arXiv preprint , year=
[18]

Transformer Circuits Thread , year=

Sofroniew, Nicholas and Kauvar, Isaac and Saunders, William and Chen, Runjin and Henighan, Tom and Hydrie, Sasha and Citro, Craig and Pearce, Adam and Tarng, Julius and Gurnee, Wes and Batson, Joshua and Zimmerman, Sam and Rivoire, Kelley and Fish, Kyle and Olah, Chris and Lindsey, Jack , title=. Transformer Circuits Thread , year=
[19]

NeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning , year=

BatchTopK Sparse Autoencoders , author=. NeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning , year=

2024
[20]

Alejandro Hernández-Cano and Alexander Hägele and Allen Hao Huang and Angelika Romanou and Antoni-Joan Solergibert and Barna Pasztor and Bettina Messmer and Dhia Garbaya and Eduard Frank Ďurech and Ido Hakimi and Juan García Giraldo and Mete Ismayilzada and Negar Foroutan and Skander Moalla and Tiancheng Chen and Vinko Sabolčec and Yixuan Xu and Michael A...
[21]

ICML 2024 Workshop on Mechanistic Interpretability , year=

Language Models Linearly Represent Sentiment , author=. ICML 2024 Workshop on Mechanistic Interpretability , year=

2024
[22]

Linguistic Regularities in Continuous Space Word Representations

Mikolov, Tomas and Yih, Wen-tau and Zweig, Geoffrey. Linguistic Regularities in Continuous Space Word Representations. Proceedings of the 2013 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013

2013
[23]

Causal Representation Learning Workshop at NeurIPS 2023 , year=

The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. Causal Representation Learning Workshop at NeurIPS 2023 , year=

2023
[24]

2022 , journal=

Toy Models of Superposition , author=. 2022 , journal=

2022
[25]

2024 , eprint=

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author=. 2024 , eprint=

2024
[26]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Refusal in Language Models Is Mediated by a Single Direction , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[27]

Steering Llama 2 via Contrastive Activation Addition

Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander. Steering Llama 2 via Contrastive Activation Addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.828

work page doi:10.18653/v1/2024.acl-long.828 2024
[28]

2017 , eprint=

Learning to Generate Reviews and Discovering Sentiment , author=. 2017 , eprint=

2017
[29]

European Conference on Information Retrieval , year=

On Interpretability and Feature Representations: An Analysis of the Sentiment Neuron , author=. European Conference on Information Retrieval , year=
[30]

Thirty-seventh Conference on Neural Information Processing Systems , year=

The geometry of hidden representations of large transformer models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[31]

The Thirteenth International Conference on Learning Representations , year=

Emergence of a High-Dimensional Abstraction Phase in Language Transformers , author=. The Thirteenth International Conference on Learning Representations , year=
[32]

Proceedings of ACL , year=

Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words , author=. Proceedings of ACL , year=
[33]

, author=

A circumplex model of affect. , author=. Journal of personality and social psychology , volume=. 1980 , publisher=

1980
[34]

International conference on machine learning , pages=

Similarity of neural network representations revisited , author=. International conference on machine learning , pages=. 2019 , organization=

2019
[35]

2026 , url =

Gemma 4: Expanding the Gemmaverse with Apache 2.0 , author =. 2026 , url =

2026
[36]

2026 , eprint=

Latent Structure of Affective Representations in Large Language Models , author=. 2026 , eprint=

2026
[37]

2026 , eprint=

Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control , author=. 2026 , eprint=

2026

[1] [1]

arXiv preprint arXiv:2410.19750 , year=

The Geometry of Concepts: Sparse Autoencoder Feature Structure , author=. arXiv preprint arXiv:2410.19750 , year=

arXiv

[2] [2]

arXiv preprint arXiv:2403.19647 , year=

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models , author=. arXiv preprint arXiv:2403.19647 , year=

Pith/arXiv arXiv

[3] [3]

arXiv preprint arXiv:2309.08600 , year=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. arXiv preprint arXiv:2309.08600 , year=

Pith/arXiv arXiv

[4] [4]

Transformer Circuits Thread , year=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. Transformer Circuits Thread , year=

[5] [5]

arXiv preprint arXiv:2408.05147 , year=

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 , author=. arXiv preprint arXiv:2408.05147 , year=

Pith/arXiv arXiv

[6] [6]

arXiv preprint arXiv:2407.14435 , year=

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders , author=. arXiv preprint arXiv:2407.14435 , year=

Pith/arXiv arXiv

[7] [7]

arXiv preprint arXiv:2406.04093 , year=

Scaling and Evaluating Sparse Autoencoders , author=. arXiv preprint arXiv:2406.04093 , year=

Pith/arXiv arXiv

[8] [8]

Transformer Circuits Thread , year=

A Mathematical Framework for Transformer Circuits , author=. Transformer Circuits Thread , year=

[9] [9]

Interpretability in the Wild: A Circuit for Indirect Object Identification in

Wang, Kevin and Variengien, Alexandre and Conmy, Arthur and Shlegeris, Buck and Steinhardt, Jacob , booktitle=. Interpretability in the Wild: A Circuit for Indirect Object Identification in

[10] [10]

Advances in Neural Information Processing Systems , year=

Towards Automated Circuit Discovery for Mechanistic Interpretability , author=. Advances in Neural Information Processing Systems , year=

[11] [11]

Advances in Neural Information Processing Systems , year=

Investigating Gender Bias in Language Models Using Causal Mediation Analysis , author=. Advances in Neural Information Processing Systems , year=

[12] [12]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle=. Locating and Editing Factual Associations in

[13] [13]

arXiv preprint arXiv:2304.05969 , year=

Localizing Model Behavior with Path Patching , author=. arXiv preprint arXiv:2304.05969 , year=

Pith/arXiv arXiv

[14] [14]

arXiv preprint arXiv:1610.01644 , year=

Understanding Intermediate Layers Using Linear Classifier Probes , author=. arXiv preprint arXiv:1610.01644 , year=

Pith/arXiv arXiv

[15] [15]

Frontiers in Systems Neuroscience , volume=

Representational Similarity Analysis -- Connecting the Branches of Systems Neuroscience , author=. Frontiers in Systems Neuroscience , volume=

[16] [16]

Distill , year=

Feature Visualization , author=. Distill , year=

[17] [17]

arXiv preprint , year=

Gemma 3 Technical Report , author=. arXiv preprint , year=

[18] [18]

Transformer Circuits Thread , year=

Sofroniew, Nicholas and Kauvar, Isaac and Saunders, William and Chen, Runjin and Henighan, Tom and Hydrie, Sasha and Citro, Craig and Pearce, Adam and Tarng, Julius and Gurnee, Wes and Batson, Joshua and Zimmerman, Sam and Rivoire, Kelley and Fish, Kyle and Olah, Chris and Lindsey, Jack , title=. Transformer Circuits Thread , year=

[19] [19]

NeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning , year=

BatchTopK Sparse Autoencoders , author=. NeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning , year=

2024

[20] [20]

Alejandro Hernández-Cano and Alexander Hägele and Allen Hao Huang and Angelika Romanou and Antoni-Joan Solergibert and Barna Pasztor and Bettina Messmer and Dhia Garbaya and Eduard Frank Ďurech and Ido Hakimi and Juan García Giraldo and Mete Ismayilzada and Negar Foroutan and Skander Moalla and Tiancheng Chen and Vinko Sabolčec and Yixuan Xu and Michael A...

[21] [21]

ICML 2024 Workshop on Mechanistic Interpretability , year=

Language Models Linearly Represent Sentiment , author=. ICML 2024 Workshop on Mechanistic Interpretability , year=

2024

[22] [22]

Linguistic Regularities in Continuous Space Word Representations

Mikolov, Tomas and Yih, Wen-tau and Zweig, Geoffrey. Linguistic Regularities in Continuous Space Word Representations. Proceedings of the 2013 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013

2013

[23] [23]

Causal Representation Learning Workshop at NeurIPS 2023 , year=

The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. Causal Representation Learning Workshop at NeurIPS 2023 , year=

2023

[24] [24]

2022 , journal=

Toy Models of Superposition , author=. 2022 , journal=

2022

[25] [25]

2024 , eprint=

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author=. 2024 , eprint=

2024

[26] [26]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Refusal in Language Models Is Mediated by a Single Direction , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[27] [27]

Steering Llama 2 via Contrastive Activation Addition

Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander. Steering Llama 2 via Contrastive Activation Addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.828

work page doi:10.18653/v1/2024.acl-long.828 2024

[28] [28]

2017 , eprint=

Learning to Generate Reviews and Discovering Sentiment , author=. 2017 , eprint=

2017

[29] [29]

European Conference on Information Retrieval , year=

On Interpretability and Feature Representations: An Analysis of the Sentiment Neuron , author=. European Conference on Information Retrieval , year=

[30] [30]

Thirty-seventh Conference on Neural Information Processing Systems , year=

The geometry of hidden representations of large transformer models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

[31] [31]

The Thirteenth International Conference on Learning Representations , year=

Emergence of a High-Dimensional Abstraction Phase in Language Transformers , author=. The Thirteenth International Conference on Learning Representations , year=

[32] [32]

Proceedings of ACL , year=

Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words , author=. Proceedings of ACL , year=

[33] [33]

, author=

A circumplex model of affect. , author=. Journal of personality and social psychology , volume=. 1980 , publisher=

1980

[34] [34]

International conference on machine learning , pages=

Similarity of neural network representations revisited , author=. International conference on machine learning , pages=. 2019 , organization=

2019

[35] [35]

2026 , url =

Gemma 4: Expanding the Gemmaverse with Apache 2.0 , author =. 2026 , url =

2026

[36] [36]

2026 , eprint=

Latent Structure of Affective Representations in Large Language Models , author=. 2026 , eprint=

2026

[37] [37]

2026 , eprint=

Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control , author=. 2026 , eprint=

2026