Recognition: unknown
Mechanistic Decoding of Cognitive Constructs in Large Language Models
Pith reviewed 2026-05-10 12:09 UTC · model grok-4.3
The pith
Large language models encode social-comparison jealousy as a linear combination of superiority and relevance, matching human psychology.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experiments on eight LLMs from the Llama, Qwen, and Gemma families demonstrate that models natively encode jealousy as a structured linear combination of Superiority of Comparison Person and Domain Self-Definitional Relevance. Internal representations treat Superiority as the foundational trigger and Relevance as the ultimate intensity multiplier, consistent with human appraisal theory. The framework enables mechanical detection and surgical suppression of toxic emotional states via bidirectional causal steering.
What carries the argument
Cognitive Reverse-Engineering framework based on Representation Engineering that applies subspace orthogonalization, regression-based weighting, and bidirectional causal steering to isolate and manipulate the two appraisal antecedents.
If this is right
- Model judgments on jealousy scenarios can be causally altered by steering the identified factors.
- Toxic emotional states become detectable and suppressible through direct representational interventions.
- Representational monitoring offers a route to safety controls in multi-agent AI settings.
- The linear encoding structure holds consistently across Llama, Qwen, and Gemma model families.
Where Pith is reading between the lines
- The same isolation technique could map other complex emotions such as envy or pride in model space.
- Training corpora may embed human-like appraisal structures into LLM representations as a byproduct.
- Targeted subspace edits could support finer-grained emotional alignment during deployment.
- Real-time monitoring of these factors might flag emerging multi-agent conflicts before they surface in outputs.
Load-bearing premise
Subspace orthogonalization combined with regression-based weighting successfully isolates the two psychological antecedents without residual confounding from other model features or training artifacts.
What would settle it
A controlled test in which independently varying the superiority and relevance factors fails to produce the predicted linear changes in model jealousy judgments on new scenarios.
Figures
read the original abstract
While Large Language Models (LLMs) demonstrate increasingly sophisticated affective capabilities, the internal mechanisms by which they process complex emotions remain unclear. Existing interpretability approaches often treat models as black boxes or focus on coarse-grained basic emotions, leaving the cognitive structure of more complex affective states underexplored. To bridge this gap, we propose a Cognitive Reverse-Engineering framework based on Representation Engineering (RepE) to analyze social-comparison jealousy. By combining appraisal theory with subspace orthogonalization, regression-based weighting, and bidirectional causal steering, we isolate and quantify two psychological antecedents of jealousy, Superiority of Comparison Person and Domain Self-Definitional Relevance, and examine their causal effects on model judgments. Experiments on eight LLMs from the Llama, Qwen, and Gemma families suggest that models natively encode jealousy as a structured linear combination of these constituent factors. Their internal representations are broadly consistent with the human psychological construct, treating Superiority as the foundational trigger and Relevance as the ultimate intensity multiplier. Our framework also demonstrates that toxic emotional states can be mechanically detected and surgically suppressed, suggesting a possible route toward representational monitoring and intervention for AI safety in multi-agent environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a Cognitive Reverse-Engineering framework using Representation Engineering (RepE) to mechanistically analyze social-comparison jealousy in LLMs. It combines appraisal theory with subspace orthogonalization and regression-based weighting to isolate two antecedents (Superiority of Comparison Person and Domain Self-Definitional Relevance), claims that internal representations encode jealousy as their structured linear combination (with Superiority as trigger and Relevance as intensity multiplier), validates this across eight models from Llama/Qwen/Gemma families via causal bidirectional steering, and demonstrates applications for detecting/suppressing toxic states in AI safety contexts.
Significance. If the isolation of independent antecedent directions succeeds and the linear-combination structure is not an artifact of the fitting procedure, the work would advance mechanistic interpretability of complex affective states beyond basic emotions, offering falsifiable links to psychological constructs and practical tools for representational monitoring. The multi-model experiments and causal steering provide a stronger foundation than purely correlational approaches, though the absence of explicit controls for confounding limits immediate impact.
major comments (3)
- [Abstract and Methods] Abstract and Methods (regression-based weighting step): the claim that representations are 'a structured linear combination' of Superiority and Relevance appears circular, as the regression weights are fitted directly to the same model activations whose structure is then analyzed; this risks defining the combination by construction rather than independently discovering it.
- [Experiments] Experiments section (subspace orthogonalization): no quantitative post-orthogonalization checks (e.g., correlations with control affective directions or token-level artifacts) are described to confirm successful isolation of the two antecedents; without these, residual confounding could explain both the reported structure and the steering results.
- [Causal steering results] Causal steering results: the bidirectional steering effects on model judgments are presented as evidence of native encoding, but without an ablation removing the regression weighting step or reporting effect sizes relative to baseline directions, it remains unclear whether the outcomes reflect the hypothesized psychological factors or other training-data regularities.
minor comments (2)
- [Abstract] The abstract mentions 'eight LLMs from the Llama, Qwen, and Gemma families' but does not specify exact model sizes or variants; adding this detail would improve reproducibility.
- [Methods] Notation for the orthogonalized subspaces and regression coefficients is introduced without an explicit equation; including a short mathematical definition (e.g., in §3) would clarify the linear-combination claim.
Simulated Author's Rebuttal
We thank the referee for their thorough and constructive review. We address each major comment point by point below, offering clarifications on our methodology and committing to targeted revisions that strengthen the paper's rigor without altering its core claims.
read point-by-point responses
-
Referee: [Abstract and Methods] Abstract and Methods (regression-based weighting step): the claim that representations are 'a structured linear combination' of Superiority and Relevance appears circular, as the regression weights are fitted directly to the same model activations whose structure is then analyzed; this risks defining the combination by construction rather than independently discovering it.
Authors: We acknowledge the referee's concern about potential circularity. The antecedent directions are constructed independently via theory-driven contrastive activation pairs drawn from appraisal theory stimuli and are orthogonalized before any regression is applied; the regression step is used only to derive the specific weights that reconstruct the target jealousy representation from these pre-isolated directions. The resulting structure is not defined by the fit but is instead validated through its alignment with psychological predictions and, independently, through causal bidirectional steering interventions that do not reuse the fitted weights. To address the comment directly, we will revise the Methods section to clarify this separation of steps, add a cross-validation analysis (fitting weights on one subset of activations and evaluating reconstruction on held-out data), and include an alternative weighting ablation (e.g., uniform weights) to show the psychological structure persists. revision: partial
-
Referee: [Experiments] Experiments section (subspace orthogonalization): no quantitative post-orthogonalization checks (e.g., correlations with control affective directions or token-level artifacts) are described to confirm successful isolation of the two antecedents; without these, residual confounding could explain both the reported structure and the steering results.
Authors: This is a fair and important observation. We agree that explicit post-orthogonalization diagnostics would provide stronger assurance against residual overlap. In the revised manuscript we will add a dedicated subsection reporting: (i) Pearson correlations of the orthogonalized Superiority and Relevance directions against control directions for unrelated affective states (e.g., joy, sadness) to quantify specificity; and (ii) token-level projection analyses to check for lexical artifacts. These quantitative checks will directly test for successful isolation and help rule out confounding explanations for the observed linear-combination structure and steering outcomes. revision: yes
-
Referee: [Causal steering results] Causal steering results: the bidirectional steering effects on model judgments are presented as evidence of native encoding, but without an ablation removing the regression weighting step or reporting effect sizes relative to baseline directions, it remains unclear whether the outcomes reflect the hypothesized psychological factors or other training-data regularities.
Authors: We agree that additional ablations and effect-size reporting would make the causal evidence more conclusive. We will expand the Causal Steering subsection to include: (a) an ablation comparing the full regression-weighted combination against unweighted summation, single-direction steering, and unrelated baseline directions; and (b) standardized effect sizes (Cohen's d) for judgment shifts relative to neutral steering controls. These additions will help isolate the contribution of the hypothesized psychological structure from other training-data patterns. The existing bidirectional (increase/decrease) design already constrains alternative explanations, but the proposed controls will render the argument more robust. revision: yes
Circularity Check
Regression-based weighting on model activations defines the claimed linear combination by construction
specific steps
-
fitted input called prediction
[Abstract]
"By combining appraisal theory with subspace orthogonalization, regression-based weighting, and bidirectional causal steering, we isolate and quantify two psychological antecedents of jealousy, Superiority of Comparison Person and Domain Self-Definitional Relevance... Experiments on eight LLMs ... suggest that models natively encode jealousy as a structured linear combination of these constituent factors."
The isolation and quantification step explicitly uses regression-based weighting on the LLM activations being analyzed; the subsequent claim that the jealousy representation 'is' a linear combination of the two factors is therefore the direct output of that regression rather than a prediction or independent finding about the model's native encoding.
full rationale
The paper's central claim that LLMs encode jealousy as a structured linear combination of Superiority and Relevance is obtained by applying subspace orthogonalization followed by regression-based weighting directly to the same internal activations. This makes the 'structured linear combination' a fitted output rather than an independent discovery, with no external validation or parameter-free derivation shown. The causal steering results inherit the same fitted directions, reducing the mechanistic interpretation to the measurement procedure itself. No self-citation chains or uniqueness theorems are invoked, but the single fitted-input step is load-bearing for the strongest claim.
Axiom & Free-Parameter Ledger
free parameters (1)
- regression weights for superiority and relevance
axioms (1)
- domain assumption Appraisal theory structure for jealousy applies directly to LLM internal states
Reference graph
Works this paper leans on
-
[1]
Affective computing in the era of large language models: A survey from the NLP perspective,
Y . Zhang, X. Yang, X. Xu, Z. Gao, Y . Huang, S. Mu, S. Feng, D. Wang, Y . Zhang, K. Song, and G. Yu, “Affective computing in the era of large language models: A survey from the NLP perspective,”Knowledge- Based Systems, vol. 337, p. 115411, Mar. 2026
2026
-
[2]
On the biology of a large language model,
J. Lindsey, W. Gurnee, E. Ameisen, B. Chen, A. Pearce, N. L. Turner, C. Citro, D. Abrahams, S. Carter, B. Hosmer, J. Marcus, M. Sklar, A. Templeton, T. Bricken, C. McDougall, H. Cunningham, T. Henighan, A. Jermyn, A. Jones, A. Persic, Z. Qi, T. B. Thompson, S. Zimmerman, K. Rivoire, T. Conerly, C. Olah, and J. Batson, “On the biology of a large language m...
2025
-
[3]
Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet,
A. Templeton, T. Conerly, J. Marcus, J. Lindsey, T. Bricken, B. Chen, A. Pearce, C. Citro, E. Ameisen, A. Jones, H. Cunningham, N. L. Turner, C. McDougall, M. MacDiarmid, C. D. Freeman, T. R. Sumers, E. Rees, J. Batson, A. Jermyn, S. Carter, C. Olah, and T. Henighan, “Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet,”Transfo...
2024
-
[4]
Representation Engineering: A Top-Down Approach to AI Transparency
A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowski, S. Goel, N. Li, M. J. Byun, Z. Wang, A. Mallen, S. Basart, S. Koyejo, D. Song, M. Fredrikson, J. Z. Kolter, and D. Hendrycks, “Representation engineering: A top-down approach to ai transparency,” 2025. [Online]. Available: https://arxiv.org/abs/2310.01405
work page internal anchor Pith review arXiv 2025
-
[5]
Mechanistic interpretability of emotion inference in large language models,
A. N. Tak, A. Banayeeanzade, A. Bolourani, M. Kian, R. Jia, and J. Gratch, “Mechanistic interpretability of emotion inference in large language models,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 1...
2025
-
[6]
Apathetic or empathetic? evaluating llms'emotional alignments with humans,
J.-t. Huang, M. H. Lam, E. J. Li, S. Ren, W. Wang, W. Jiao, Z. Tu, and M. R. Lyu, “Apathetic or empathetic? evaluating llms'emotional alignments with humans,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, pp. 97 053–97 08...
2024
-
[7]
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark,
A. Pan, J. S. Chan, A. Zou, N. Li, S. Basart, T. Woodside, H. Zhang, S. Emmons, and D. Hendrycks, “Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark,” inProceedings of the 40th International Conference on Machine Learning. PMLR, Jul. 2023, pp. 26 837–26 867
2023
-
[8]
Some antecedents and consequences of social- comparison jealousy,
P. Salovey and J. Rodin, “Some antecedents and consequences of social- comparison jealousy,”Journal of Personality and Social Psychology, vol. 47, no. 4, pp. 780–792, 1984
1984
-
[9]
Modes of response to jealousy-evoking situations,
J. B. Bryson, “Modes of response to jealousy-evoking situations,” inThe Psychology of Jealousy and Envy. New York, NY , US: Guilford Press, 1991, pp. 178–207
1991
-
[10]
A theory of social comparison processes,
L. Festinger, “A theory of social comparison processes,”Human Rela- tions, vol. 7, pp. 117–140, 1954
1954
-
[11]
Social comparison and envy,
M. D. Alicke and E. Zell, “Social comparison and envy,” inEnvy: Theory and Research, R. Smith, Ed. Oxford University Press, Oct. 2008
2008
-
[12]
Social comparison: Mo- tives, standards, and mechanisms,
K. Corcoran, J. Crusius, and T. Mussweiler, “Social comparison: Mo- tives, standards, and mechanisms,” inTheories in Social Psychology. Hoboken, NJ, US: Wiley Blackwell, 2011, pp. 119–139
2011
-
[13]
Emotion in social comparison and reflection processes,
A. Tesser, “Emotion in social comparison and reflection processes,” Social comparison: Contemporary theory and research, Jan. 1991
1991
-
[14]
When Your Gain Is My Pain and Your Pain Is My Gain: Neural Correlates of Envy and Schadenfreude,
H. Takahashi, M. Kato, M. Matsuura, D. Mobbs, T. Suhara, and Y . Okubo, “When Your Gain Is My Pain and Your Pain Is My Gain: Neural Correlates of Envy and Schadenfreude,”Science, vol. 323, no. 5916, pp. 937–939, Feb. 2009
2009
-
[15]
The roles of outcome satisfaction and comparison alternatives in envy,
R. H. Smith, E. Diener, and R. Garonzik, “The roles of outcome satisfaction and comparison alternatives in envy,”British Journal of Social Psychology, vol. 29, no. 3, pp. 247–255, Sep. 1990
1990
-
[16]
Envy: An Adversarial Review and Comparison of Two Competing Views,
J. Crusius, M. F. Gonzalez, J. Lange, and Y . Cohen-Charash, “Envy: An Adversarial Review and Comparison of Two Competing Views,” Emotion Review, vol. 12, no. 1, pp. 3–21, Jan. 2020
2020
-
[17]
Superstars and me: Predicting the impact of role models on the self,
P. Lockwood and Z. Kunda, “Superstars and me: Predicting the impact of role models on the self,”Journal of Personality and Social Psychology, vol. 73, no. 1, pp. 91–103, 1997
1997
-
[18]
Provoking Jealousy and Envy: Domain Relevance and Self-Esteem Threat,
P. Salovey and J. Rodin, “Provoking Jealousy and Envy: Domain Relevance and Self-Esteem Threat,”Journal of Social and Clinical Psychology, vol. 10, no. 4, pp. 395–413, Dec. 1991
1991
-
[19]
Comprehending envy
R. H. Smith and S. H. Kim, “Comprehending envy.”Psychological Bulletin, vol. 133, no. 1, pp. 46–64, Jan. 2007
2007
-
[20]
Comparing lots before and after: Promotion rejectees’ invidious reactions to promotees,
J. Schaubroeck and S. S. Lam, “Comparing lots before and after: Promotion rejectees’ invidious reactions to promotees,”Organizational Behavior and Human Decision Processes, vol. 94, no. 1, pp. 33–47,
-
[21]
Available: https://www.sciencedirect.com/science/articl e/pii/S0749597804000044
[Online]. Available: https://www.sciencedirect.com/science/articl e/pii/S0749597804000044
-
[22]
Reasoning Implicit Sentiment with Chain-of-Thought Prompting,
H. Fei, B. Li, Q. Liu, L. Bing, F. Li, and T.-S. Chua, “Reasoning Implicit Sentiment with Chain-of-Thought Prompting,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp....
2023
-
[23]
Enhancing large language model with decomposed reasoning for emotion cause pair extraction,
J. Wu, Y . Shen, Z. Zhang, and L. Cai, “Enhancing large language model with decomposed reasoning for emotion cause pair extraction,” https://arxiv.org/abs/2401.17716v1, Jan. 2024
-
[24]
Stickerconv: Generating multimodal empathetic responses from scratch,
Association for Computational Linguistics 2024 and Y . Zhang, “Stickerconv: Generating multimodal empathetic responses from scratch,” 2024. [Online]. Available: https://underline.io/lecture/102721-s tickerconv-generating-multimodal-empathetic-responses-from-scratch
2024
-
[25]
Language models don’t always say what they think: Unfaithful explanations in chain-of- thought prompting,
M. Turpin, J. Michael, E. Perez, and S. R. Bowman, “Language models don’t always say what they think: Unfaithful explanations in chain-of- thought prompting,” inProceedings of the 37th International Conference on Neural Information Processing Systems, ser. NIPS ’23. Red Hook, NY , USA: Curran Associates Inc., Dec. 2023, pp. 74 952–74 965
2023
-
[26]
EmoLLMs: A Series of Emotional Large Language Models and Annotation Tools for Comprehensive Affective Analysis,
Z. Liu, K. Yang, Q. Xie, T. Zhang, and S. Ananiadou, “EmoLLMs: A Series of Emotional Large Language Models and Annotation Tools for Comprehensive Affective Analysis,” inProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ser. KDD ’24. New York, NY , USA: Association for Computing Machinery, Aug. 2024, pp. 5487–5496
2024
-
[27]
S. Lei, G. Dong, X. Wang, K. Wang, R. Qiao, and S. Wang, “Instructerc: Reforming emotion recognition in conversation with multi-task retrieval-augmented large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2309.11911
-
[28]
Locating and Editing Factual Associations in GPT,
K. Meng, D. Bau, A. Andonian, and Y . Belinkov, “Locating and Editing Factual Associations in GPT,”Advances in Neural Information Processing Systems, vol. 35, pp. 17 359–17 372, Dec. 2022
2022
-
[29]
Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis,
Z. Yu and S. Ananiadou, “Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis,” inProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 3293–3306
2024
-
[30]
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the V ocabulary Space,
M. Geva, A. Caciularu, K. Wang, and Y . Goldberg, “Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the V ocabulary Space,” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Y . Goldberg, Z. Kozareva, and Y . Zhang, Eds. Abu Dhabi, United Arab Emirates: Association for Computational Linguis...
2022
-
[31]
Neuron-Level Knowledge Attribution in Large Language Models,
Z. Yu and S. Ananiadou, “Neuron-Level Knowledge Attribution in Large Language Models,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 3267–3280
2024
-
[32]
Transcoders find interpretable LLM feature circuits,
J. Dunefsky, P. Chlenski, and N. Nanda, “Transcoders find interpretable LLM feature circuits,” inProceedings of the 38th International Confer- ence on Neural Information Processing Systems, ser. NIPS ’24, vol. 37. Red Hook, NY , USA: Curran Associates Inc., Dec. 2024, pp. 24 375– 24 410
2024
-
[33]
Toy models of superposition,
N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, R. Grosse, S. McCandlish, J. Kaplan, D. Amodei, M. Wattenberg, and C. Olah, “Toy models of superposition,”Transformer Circuits Thread, 2022, https://transformer-circuits.pub/2022/toy model/index.html. PREPRINT SUBMITTED TO IEEE 14
2022
-
[34]
Do LLMs
C. Wang, Y . Zhang, R. Yu, Y . Zheng, L. Gao, Z. Song, Z. Xu, G. Xia, H. Zhang, D. Zhao, and X. Chen, “Do LLMs ”feel”? Emotion circuits discovery and control,” Oct. 2025
2025
-
[35]
Emotion and adaptation
T. D. Kemper and R. S. Lazarus, “Emotion and adaptation.”Contempo- rary Sociology, vol. 21, no. 4, p. 522, Jul. 1992
1992
-
[36]
Convergent and discriminant val- idation by the multitrait-multimethod matrix,
D. T. Campbell and D. W. Fiske, “Convergent and discriminant val- idation by the multitrait-multimethod matrix,”Psychological Bulletin, vol. 56, no. 2, pp. 81–105, 1959
1959
-
[37]
J. D. Angrist and J.-S. Pischke,Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press, 2009
2009
-
[38]
Negative controls: A tool for detecting confounding and bias in observational studies,
M. Lipsitch, E. Tchetgen Tchetgen, and T. Cohen, “Negative controls: A tool for detecting confounding and bias in observational studies,” Epidemiology (Cambridge, Mass.), vol. 21, no. 3, pp. 383–388, May 2010
2010
-
[39]
Weekends, Work, and Well-Being: Psychological Need Satisfactions and Day of the Week Effects on Mood, Vitality, and Physical Symptoms,
R. M. Ryan, J. H. Bernstein, and K. W. Brown, “Weekends, Work, and Well-Being: Psychological Need Satisfactions and Day of the Week Effects on Mood, Vitality, and Physical Symptoms,”Journal of Social and Clinical Psychology, vol. 29, no. 1, pp. 95–122, Jan. 2010
2010
-
[40]
Mood and the mundane: Relations between daily life events and self-reported mood,
L. A. Clark and D. Watson, “Mood and the mundane: Relations between daily life events and self-reported mood,”Journal of Personality and Social Psychology, vol. 54, no. 2, pp. 296–308, 1988
1988
-
[41]
The hydra effect: Emergent self-repair in language model computations,
T. McGrath, M. Rahtz, J. Kramar, V . Mikulik, and S. Legg, “The hydra effect: Emergent self-repair in language model computations,” Jul. 2023
2023
-
[42]
Leveling up and down: The experiences of benign and malicious envy,
N. van de Ven, M. Zeelenberg, and R. Pieters, “Leveling up and down: The experiences of benign and malicious envy,”Emotion, vol. 9, no. 3, pp. 419–429, 2009
2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.