arxiv: 2604.03147 · v3 · submitted 2026-04-03 · 💻 cs.CL · cs.AI· cs.CY

Recognition: 2 theorem links

· Lean Theorem

Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control

Lihao Sun , Lewen Yan , Xiaoya Lu , Andrew Lee , Jie Zhang , Jing Shao

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY

keywords valence-arousalemotion vectorsLLM steeringcircular geometryrefusalsycophancyPCA

0 comments

The pith

LLMs organize emotion vectors in a circular valence-arousal subspace that controls affective tone and behaviors like refusal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that emotion steering vectors inside large language models sit in a two-dimensional subspace defined by valence and arousal and arranged in a circle. Axes recovered through principal component analysis and ridge regression align with human ratings of affect across tens of thousands of words. Moving along these axes changes the emotional quality of generated text in a steady, one-way manner. The same movement also raises or lowers refusal and sycophantic tendencies from a single shared space. The authors attribute the control to lexical mediation: refusal and compliance tokens occupy separate regions of the valence-arousal plane, so steering directly alters their emission probabilities.

Core claim

Emotion vectors in LLMs are organized by a two-dimensional valence-arousal (VA) subspace exhibiting circular geometry. Through principal component decomposition and ridge regression, we recover meaningful VA axes underlying emotion steering vectors whose projections correlate with human affect ratings across 44,728 words. Steering along these axes produces monotonic control over the affective properties of generated text, and further affords bidirectional control over multiple downstream behaviors (refusal and sycophancy) from a single subspace. These effects replicate across Llama-3.1-8B, Qwen3-8B, and Qwen3-14B.

What carries the argument

The two-dimensional valence-arousal subspace recovered by PCA and ridge regression on emotion steering vectors, which shows circular geometry and modulates token emission probabilities for affect and behavior.

Load-bearing premise

The axes found by PCA and regression truly match human valence and arousal rather than some other property that happens to correlate with them.

What would settle it

A fresh collection of words with independent human affect ratings where the projections onto the recovered axes show no correlation, or steering trials that produce no change in refusal or sycophancy once other correlated factors are controlled.

Figures

Figures reproduced from arXiv: 2604.03147 by Andrew Lee, Jie Zhang, Jing Shao, Lewen Yan, Lihao Sun, Xiaoya Lu.

**Figure 1.** Figure 1: Emotion steering vectors projected onto the VA subspace at layer 31, colored by valence. Gray circle: algebraic least-squares fit. The circular arrangement is analogous to circumplex model of affect in human psychology (Russell, 1980). 2020). We then obtain the model’s self-reported VA ratings for each emotion category and learn VA subspaces as linear combinations of principal components of the emotion v… view at source ↗

**Figure 2.** Figure 2: Recovery of self-reported VA scores via correlation with learned subspace projections across layers. Solid lines: ridge regression over multiple PCs; dashed lines: best single PC. Circular Geometry of Emotion Representations. Projecting emotion steering vectors onto the learned VA subspace reveals a circular arrangement analogous to Russell’s circumplex model found in human psychology (Russell, 1980). Qu… view at source ↗

**Figure 3.** Figure 3: Radial heatmaps showing the effect of VA steering on open-ended generation. Each panel displays the change relative to unsteered baseline as a function of steering direction (angle) and strength (radius, α ∈ [0.01, 0.45]). Cardinal directions: 0 ◦ = +V , 90◦ = +A, 180◦ = −V , 270◦ = −A. (a) Valence change (VAD-BERT). (b) Arousal change (VAD-BERT). (c) Sentiment change (VADER). The horizontal gradient in (a… view at source ↗

**Figure 4.** Figure 4: Valence-arousal steering controls refusal behavior. Refusal rate as a function of signed steering strength α across three safety benchmarks. Steering along both valence (blue) and arousal (orange) directions bidirectionally modulates refusal rates, with negative α (decreasing V/A) increasing refusals and positive α (increasing V/A) suppressing them. Random directions within the representation space (gray) … view at source ↗

**Figure 5.** Figure 5: Valence steering reduces sycophantic behavior. Sycophancy rate as a function of steering strength α across three benchmarks. Random steering directions (gray) remain near baseline with minimal variation. itive valence reduce sycophancy on NLP Survey (97% → 91% and 85%, respectively). On Political Typology, negative valence produces the larger effect (78% → 47% vs. 68% at +0.30). Not all benchmarks exhibit… view at source ↗

**Figure 6.** Figure 6: Mechanistic evidence for lexical mediation. (a) Token Unembedding: Projection of unembedding vectors onto VA space. Refusal-associated tokens (red) cluster in the −V region, while compliance-associated tokens (green) cluster in the +V region. Squares indicate group means; the arrow shows their difference direction (256° on the circumplex). (b) MLP Neuron Alignment: VA alignment of neurons promoting refusal… view at source ↗

**Figure 7.** Figure 7: Logit Lens Analysis: Top-3 Predicted Tokens Across Layers 18–31. Rows show predictions for harmful (red) and safe (green) prompts at each layer. Columns represent baseline and four steering conditions (+A, -A, +V, -V at α = 0.45). Harmful prompts consistently predict refusal tokens ( I, cannot), while safe prompts predict compliance tokens ( yes, Yes). Valence steering (+V/-V) modulates sentiment-related t… view at source ↗

read the original abstract

We show that emotion vectors in LLMs are organized by a two-dimensional valence-arousal (VA) subspace exhibiting circular geometry. Through principal component decomposition and ridge regression, we recover meaningful VA axes underlying emotion steering vectors whose projections correlate with human affect ratings across 44,728 words. Steering along these axes produces monotonic control over the affective properties of generated text, and further affords bidirectional control over multiple downstream behaviors (refusal and sycophancy) from a single subspace. These effects replicate across Llama-3.1-8B, Qwen3-8B, and Qwen3-14B. We propose lexical mediation to explain why these effects and prior emotionally framed controls work: refusal and compliance tokens occupy distinct VA regions, and VA steering directly modulates their emission probabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper recovers a valence-arousal subspace from emotion steering vectors that correlates with human ratings and appears to steer both affect and refusal/sycophancy, but the evidence does not yet rule out simple lexical correlations.

read the letter

The core claim is that PCA plus ridge regression on emotion vectors yields a circular two-dimensional VA subspace whose axes control affective tone in generation and also give bidirectional control over refusal and sycophancy from the same directions. They report replication on Llama-3.1-8B, Qwen3-8B, and Qwen3-14B, and they tie the behavioral effects to a lexical-mediation story in which refusal and compliance tokens sit in different VA regions.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that emotion vectors in LLMs are organized by a two-dimensional valence-arousal (VA) subspace with circular geometry. Using PCA on emotion steering vectors followed by ridge regression against human VA ratings on 44,728 words, the authors recover axes whose projections correlate with affect ratings. Steering along these axes produces monotonic control over affective properties of generated text and bidirectional control over downstream behaviors including refusal and sycophancy. Results replicate across Llama-3.1-8B, Qwen3-8B, and Qwen3-14B. A lexical mediation account is proposed to explain the effects via differential VA positioning of refusal and compliance tokens.

Significance. If the central claims hold after addressing confounds, the work would offer a mechanistically grounded subspace for unifying affective and multi-behavioral steering in LLMs. The cross-model replication and demonstration of bidirectional control from a single recovered subspace are notable strengths. The lexical mediation hypothesis provides a testable link to prior steering results. These findings could inform both interpretability and practical alignment techniques, though their impact hinges on demonstrating specificity to human VA dimensions rather than generic lexical correlates.

major comments (3)

[§4.2] §4.2 (ridge regression and correlation analysis): No controls, partial correlations, or regression covariates are reported for word frequency, concreteness, or other lexical properties known to correlate with VA ratings. Without these, the recovered axes may capture any co-varying lexical feature rather than model-internal affective geometry, directly weakening the claim that the subspace is specifically valence-arousal.
[§5.1] §5.1 (behavioral steering results): The manuscript provides no statistical tests, effect sizes, confidence intervals, or exclusion criteria for the 44,728-word ratings or the downstream refusal/sycophancy experiments. This absence makes it impossible to evaluate whether the reported monotonic and bidirectional effects are robust or potentially driven by incidental correlations.
[§3.3] §3.3 (circular geometry): The circular arrangement is presented visually from 2D projections, but no quantitative test (e.g., comparison against shuffled or random baselines, or a measure of angular uniformity) is described to establish that the geometry is statistically meaningful rather than an artifact of dimensionality reduction.

minor comments (2)

[§2.1] The definition of emotion steering vectors in §2.1 would be clearer with an explicit equation showing their extraction from residual streams or attention layers.
[Figure 2] Figure 2 and Figure 4 would benefit from overlaid density contours or bootstrap confidence regions to aid interpretation of the VA projections and steering trajectories.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us identify areas to strengthen the manuscript. We address each major point below and will incorporate revisions accordingly.

read point-by-point responses

Referee: [§4.2] §4.2 (ridge regression and correlation analysis): No controls, partial correlations, or regression covariates are reported for word frequency, concreteness, or other lexical properties known to correlate with VA ratings. Without these, the recovered axes may capture any co-varying lexical feature rather than model-internal affective geometry, directly weakening the claim that the subspace is specifically valence-arousal.

Authors: We agree that additional controls are necessary to establish specificity to valence-arousal dimensions. In the revised manuscript, we will perform the ridge regression with covariates for word frequency, concreteness, and other lexical properties. We will report partial correlations to show that the VA projections retain significant correlation with affect ratings after controlling for these factors. This revision will directly address the concern that the axes may reflect generic lexical correlates. revision: yes
Referee: [§5.1] §5.1 (behavioral steering results): The manuscript provides no statistical tests, effect sizes, confidence intervals, or exclusion criteria for the 44,728-word ratings or the downstream refusal/sycophancy experiments. This absence makes it impossible to evaluate whether the reported monotonic and bidirectional effects are robust or potentially driven by incidental correlations.

Authors: We acknowledge the lack of statistical reporting. We will add statistical tests for the monotonic trends (e.g., linear regression slopes with p-values), effect sizes, and confidence intervals for the behavioral steering results. Exclusion criteria for the word list and experiment details will be explicitly stated. These additions will provide the necessary rigor to assess the robustness of the findings. revision: yes
Referee: [§3.3] §3.3 (circular geometry): The circular arrangement is presented visually from 2D projections, but no quantitative test (e.g., comparison against shuffled or random baselines, or a measure of angular uniformity) is described to establish that the geometry is statistically meaningful rather than an artifact of dimensionality reduction.

Authors: The circular geometry is observed in the 2D PCA projections of the emotion vectors. To provide quantitative support, we will include in the revision a statistical test for circular uniformity, such as the Rayleigh test, and comparisons to shuffled baselines where the angular distribution is randomized. This will confirm that the observed circular arrangement is not an artifact. revision: yes

Circularity Check

1 steps flagged

Ridge regression to human VA ratings makes claimed subspace geometry and steering effects dependent on fitted alignment

specific steps

fitted input called prediction [Abstract]
"Through principal component decomposition and ridge regression, we recover meaningful VA axes underlying emotion steering vectors whose projections correlate with human affect ratings across 44,728 words."

The axes are explicitly constructed via ridge regression to maximize correlation with the human VA ratings; therefore the reported correlation between projections and ratings, and the subsequent claim that steering along these axes controls affective properties, reduces to the fitting procedure by construction rather than revealing an independent model-internal geometry.

full rationale

The paper's core derivation begins with unsupervised PCA on emotion steering vectors, followed by ridge regression to align the resulting axes with external human valence-arousal ratings on 44,728 words. This alignment step renders the reported correlations and the attribution of monotonic affective control and bidirectional behavioral effects (refusal/sycophancy) to a psychologically meaningful VA subspace dependent on the fitted parameters rather than independently emergent from the model. The subsequent lexical mediation proposal and circular geometry claims inherit this dependence. No self-citation chains or definitional loops are present, and the unsupervised PCA component plus replication across models provide partial independent content, capping the circularity at moderate.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Assessment based solely on abstract; full methods and equations unavailable.

free parameters (1)

ridge regression regularization strength
Used to recover VA axes from steering vectors; value not reported in abstract.

axioms (1)

domain assumption Principal components of emotion steering vectors align with human valence-arousal dimensions.
Invoked when interpreting the recovered axes as psychologically meaningful.

pith-pipeline@v0.9.0 · 5448 in / 1190 out tokens · 24174 ms · 2026-05-13T20:11:16.497700+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We decompose emotion steering vectors... learn VA axes as linear combinations of their top PCA components via ridge regression... circular geometry consistent with... Russell (1980)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Steering along these axes produces monotonic control over... refusal and sycophancy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions
cs.LG 2026-05 unverdicted novelty 7.0

Linear probes for Othello board states factor into tensor-product structure with square and color embeddings composed by a binding matrix, from which the linear probes can be directly recovered.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
cs.CL 2026-05 unverdicted novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[2]

In Ku, L.-W., Martins, A

URL https://transformer-circuits. pub/2026/emotions/index.html. 11 Valence–Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control Sun, L., Mao, C., Hofmann, V ., and Bai, X. Aligned but blind: Alignment increases implicit bias by reduc- ing awareness of race, 2025. URL https://arxiv. org/abs/2506.00253. Tan, D., Chanin, D., Lynch...

work page doi:10.18653/v1/2024 2026
[3]

Qwen3 Technical Report

URL https://doi.org/10.24963/ijcai. 2024/719. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.24963/ijcai 2024
[4]

Representation Engineering: A Top-Down Approach to AI Transparency

URL https://aclanthology.org/2024. findings-emnlp.139/. Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., and Hendrycks, D. Representation engineering: A top-down approach to ai ...

work page internal anchor Pith review Pith/arXiv arXiv 2024