arxiv: 2604.14593 · v3 · submitted 2026-04-16 · 💻 cs.CL · cs.AI

Recognition: unknown

Mechanistic Decoding of Cognitive Constructs in Large Language Models

Yitong Shou , Manhao Guan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords jealousylarge language modelsrepresentation engineeringcognitive constructsappraisal theoryAI interpretabilityemotional statesAI safety

0 comments

The pith

Large language models encode social-comparison jealousy as a linear combination of superiority and relevance, matching human psychology.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a Cognitive Reverse-Engineering framework to examine how LLMs process the complex emotion of jealousy internally. It isolates two appraisal factors—Superiority of the Comparison Person and Domain Self-Definitional Relevance—using subspace orthogonalization and regression weighting. Experiments across eight models show these factors combine linearly in representations, with superiority as the trigger and relevance scaling intensity. This structure aligns with human psychological constructs and supports causal steering of model outputs. The work suggests a path to detect and suppress toxic emotional states through targeted interventions on model representations.

Core claim

Experiments on eight LLMs from the Llama, Qwen, and Gemma families demonstrate that models natively encode jealousy as a structured linear combination of Superiority of Comparison Person and Domain Self-Definitional Relevance. Internal representations treat Superiority as the foundational trigger and Relevance as the ultimate intensity multiplier, consistent with human appraisal theory. The framework enables mechanical detection and surgical suppression of toxic emotional states via bidirectional causal steering.

What carries the argument

Cognitive Reverse-Engineering framework based on Representation Engineering that applies subspace orthogonalization, regression-based weighting, and bidirectional causal steering to isolate and manipulate the two appraisal antecedents.

If this is right

Model judgments on jealousy scenarios can be causally altered by steering the identified factors.
Toxic emotional states become detectable and suppressible through direct representational interventions.
Representational monitoring offers a route to safety controls in multi-agent AI settings.
The linear encoding structure holds consistently across Llama, Qwen, and Gemma model families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same isolation technique could map other complex emotions such as envy or pride in model space.
Training corpora may embed human-like appraisal structures into LLM representations as a byproduct.
Targeted subspace edits could support finer-grained emotional alignment during deployment.
Real-time monitoring of these factors might flag emerging multi-agent conflicts before they surface in outputs.

Load-bearing premise

Subspace orthogonalization combined with regression-based weighting successfully isolates the two psychological antecedents without residual confounding from other model features or training artifacts.

What would settle it

A controlled test in which independently varying the superiority and relevance factors fails to produce the predicted linear changes in model jealousy judgments on new scenarios.

Figures

Figures reproduced from arXiv: 2604.14593 by Manhao Guan, Yitong Shou.

**Figure 1.** Figure 1: Phase I: Heatmap of classification accuracy across layers for all evaluated models. Lighter/yellower colors indicate higher validation accuracy, signifying robust concept representations [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗

**Figure 2.** Figure 2: Phase I: Layer-wise accuracy trajectory for Gemma-3-12B. Early layers show severe fluctuations, while mid-to-late layers stabilize near 100% accuracy [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 4.** Figure 4: Phase III: Heatmap of standardized β coefficients in the mid-to-late layers across models. Darker red indicates a stronger positive causal weight in the model’s internal computation of jealousy [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Phase III: Statistical validity for Gemma-3-12B. Top: Evolution of the three factor weights (β). Bottom: The R2 value (blue) and the groundtruth correlation (purple), both peaking in later layers [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: Phase IV: Global Layer-wise intervention heatmaps. Top: Concept Stimulation (Positive Steering). Bottom: Concept Suppression (Negative Steering). Red intensity indicates the magnitude of the score shift percentage (∆%), where redder cells indicate stronger positive/negative intervention capabilities, reflecting a more successful intervention. The right half of the figure illustrates that in the mid-to-late… view at source ↗

**Figure 8.** Figure 8: Phase IV: Score change (∆) trajectory during single-layer interventions for Gemma-3-12B. Interventions within the red region generally produce robust effects, aligning with ideal expectations. Layer 23 exhibits the optimal intervention effect [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 10.** Figure 10: Internal Cognitive Mechanism of Jealousy in LLMs. We summarize the model’s internal process with an electrical-circuit analogy: Superiority acts as the trigger (switch), determining the presence or absence of the jealousy “current,” while Relevance functions as an amplifier (variable resistor) that modulates the intensity of the resulting emotional state. which has profound implications for AI Safety and … view at source ↗

read the original abstract

While Large Language Models (LLMs) demonstrate increasingly sophisticated affective capabilities, the internal mechanisms by which they process complex emotions remain unclear. Existing interpretability approaches often treat models as black boxes or focus on coarse-grained basic emotions, leaving the cognitive structure of more complex affective states underexplored. To bridge this gap, we propose a Cognitive Reverse-Engineering framework based on Representation Engineering (RepE) to analyze social-comparison jealousy. By combining appraisal theory with subspace orthogonalization, regression-based weighting, and bidirectional causal steering, we isolate and quantify two psychological antecedents of jealousy, Superiority of Comparison Person and Domain Self-Definitional Relevance, and examine their causal effects on model judgments. Experiments on eight LLMs from the Llama, Qwen, and Gemma families suggest that models natively encode jealousy as a structured linear combination of these constituent factors. Their internal representations are broadly consistent with the human psychological construct, treating Superiority as the foundational trigger and Relevance as the ultimate intensity multiplier. Our framework also demonstrates that toxic emotional states can be mechanically detected and surgically suppressed, suggesting a possible route toward representational monitoring and intervention for AI safety in multi-agent environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies RepE with orthogonalization and regression to break jealousy into superiority and relevance factors across eight LLMs and shows steering effects, but the isolation step risks residual confounding.

read the letter

The main takeaway is that this work takes Representation Engineering and extends it to a complex social emotion by splitting jealousy into two appraisal-based pieces: superiority of the comparison target and domain relevance to the self. They orthogonalize subspaces, fit regression weights, and run bidirectional steering to shift model judgments, then tie the results to safety monitoring in multi-agent settings. Experiments cover eight models from the Llama, Qwen, and Gemma families, which is a reasonable scale for this kind of study. The claim that superiority acts as the trigger and relevance as the intensity multiplier lines up with the psychological framing they cite, and the toxic-state suppression angle is a direct practical hook. What they do well is move RepE past basic emotions and demonstrate causal intervention rather than just correlation. The soft spot is the isolation itself. Orthogonalization plus regression on the same activations can leave correlations with other internal features or training patterns untouched, and the abstract gives no numbers on post-orthogonalization controls or ablation of the regression step. Without those checks the linear-combination story could reflect fitting artifacts more than native structure. This is for interpretability people already using RepE who want to see it applied to richer affective states, or for safety folks looking at representational interventions. A reader with that background would get concrete methods and multi-model results to build on. The paper deserves peer review so the methods section can be examined for exactly those controls and any circularity in the weighting.

Referee Report

3 major / 2 minor

Summary. The paper proposes a Cognitive Reverse-Engineering framework using Representation Engineering (RepE) to mechanistically analyze social-comparison jealousy in LLMs. It combines appraisal theory with subspace orthogonalization and regression-based weighting to isolate two antecedents (Superiority of Comparison Person and Domain Self-Definitional Relevance), claims that internal representations encode jealousy as their structured linear combination (with Superiority as trigger and Relevance as intensity multiplier), validates this across eight models from Llama/Qwen/Gemma families via causal bidirectional steering, and demonstrates applications for detecting/suppressing toxic states in AI safety contexts.

Significance. If the isolation of independent antecedent directions succeeds and the linear-combination structure is not an artifact of the fitting procedure, the work would advance mechanistic interpretability of complex affective states beyond basic emotions, offering falsifiable links to psychological constructs and practical tools for representational monitoring. The multi-model experiments and causal steering provide a stronger foundation than purely correlational approaches, though the absence of explicit controls for confounding limits immediate impact.

major comments (3)

[Abstract and Methods] Abstract and Methods (regression-based weighting step): the claim that representations are 'a structured linear combination' of Superiority and Relevance appears circular, as the regression weights are fitted directly to the same model activations whose structure is then analyzed; this risks defining the combination by construction rather than independently discovering it.
[Experiments] Experiments section (subspace orthogonalization): no quantitative post-orthogonalization checks (e.g., correlations with control affective directions or token-level artifacts) are described to confirm successful isolation of the two antecedents; without these, residual confounding could explain both the reported structure and the steering results.
[Causal steering results] Causal steering results: the bidirectional steering effects on model judgments are presented as evidence of native encoding, but without an ablation removing the regression weighting step or reporting effect sizes relative to baseline directions, it remains unclear whether the outcomes reflect the hypothesized psychological factors or other training-data regularities.

minor comments (2)

[Abstract] The abstract mentions 'eight LLMs from the Llama, Qwen, and Gemma families' but does not specify exact model sizes or variants; adding this detail would improve reproducibility.
[Methods] Notation for the orthogonalized subspaces and regression coefficients is introduced without an explicit equation; including a short mathematical definition (e.g., in §3) would clarify the linear-combination claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review. We address each major comment point by point below, offering clarifications on our methodology and committing to targeted revisions that strengthen the paper's rigor without altering its core claims.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods (regression-based weighting step): the claim that representations are 'a structured linear combination' of Superiority and Relevance appears circular, as the regression weights are fitted directly to the same model activations whose structure is then analyzed; this risks defining the combination by construction rather than independently discovering it.

Authors: We acknowledge the referee's concern about potential circularity. The antecedent directions are constructed independently via theory-driven contrastive activation pairs drawn from appraisal theory stimuli and are orthogonalized before any regression is applied; the regression step is used only to derive the specific weights that reconstruct the target jealousy representation from these pre-isolated directions. The resulting structure is not defined by the fit but is instead validated through its alignment with psychological predictions and, independently, through causal bidirectional steering interventions that do not reuse the fitted weights. To address the comment directly, we will revise the Methods section to clarify this separation of steps, add a cross-validation analysis (fitting weights on one subset of activations and evaluating reconstruction on held-out data), and include an alternative weighting ablation (e.g., uniform weights) to show the psychological structure persists. revision: partial
Referee: [Experiments] Experiments section (subspace orthogonalization): no quantitative post-orthogonalization checks (e.g., correlations with control affective directions or token-level artifacts) are described to confirm successful isolation of the two antecedents; without these, residual confounding could explain both the reported structure and the steering results.

Authors: This is a fair and important observation. We agree that explicit post-orthogonalization diagnostics would provide stronger assurance against residual overlap. In the revised manuscript we will add a dedicated subsection reporting: (i) Pearson correlations of the orthogonalized Superiority and Relevance directions against control directions for unrelated affective states (e.g., joy, sadness) to quantify specificity; and (ii) token-level projection analyses to check for lexical artifacts. These quantitative checks will directly test for successful isolation and help rule out confounding explanations for the observed linear-combination structure and steering outcomes. revision: yes
Referee: [Causal steering results] Causal steering results: the bidirectional steering effects on model judgments are presented as evidence of native encoding, but without an ablation removing the regression weighting step or reporting effect sizes relative to baseline directions, it remains unclear whether the outcomes reflect the hypothesized psychological factors or other training-data regularities.

Authors: We agree that additional ablations and effect-size reporting would make the causal evidence more conclusive. We will expand the Causal Steering subsection to include: (a) an ablation comparing the full regression-weighted combination against unweighted summation, single-direction steering, and unrelated baseline directions; and (b) standardized effect sizes (Cohen's d) for judgment shifts relative to neutral steering controls. These additions will help isolate the contribution of the hypothesized psychological structure from other training-data patterns. The existing bidirectional (increase/decrease) design already constrains alternative explanations, but the proposed controls will render the argument more robust. revision: yes

Circularity Check

1 steps flagged

Regression-based weighting on model activations defines the claimed linear combination by construction

specific steps

fitted input called prediction [Abstract]
"By combining appraisal theory with subspace orthogonalization, regression-based weighting, and bidirectional causal steering, we isolate and quantify two psychological antecedents of jealousy, Superiority of Comparison Person and Domain Self-Definitional Relevance... Experiments on eight LLMs ... suggest that models natively encode jealousy as a structured linear combination of these constituent factors."

The isolation and quantification step explicitly uses regression-based weighting on the LLM activations being analyzed; the subsequent claim that the jealousy representation 'is' a linear combination of the two factors is therefore the direct output of that regression rather than a prediction or independent finding about the model's native encoding.

full rationale

The paper's central claim that LLMs encode jealousy as a structured linear combination of Superiority and Relevance is obtained by applying subspace orthogonalization followed by regression-based weighting directly to the same internal activations. This makes the 'structured linear combination' a fitted output rather than an independent discovery, with no external validation or parameter-free derivation shown. The causal steering results inherit the same fitted directions, reducing the mechanistic interpretation to the measurement procedure itself. No self-citation chains or uniqueness theorems are invoked, but the single fitted-input step is load-bearing for the strongest claim.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that appraisal theory factors can be linearly isolated in LLM activations and that regression weights reflect causal psychological structure rather than statistical artifacts.

free parameters (1)

regression weights for superiority and relevance
Weights obtained via regression to combine the two factors into a jealousy representation.

axioms (1)

domain assumption Appraisal theory structure for jealousy applies directly to LLM internal states
The framework presupposes that human psychological antecedents map onto model subspaces without major distortion.

pith-pipeline@v0.9.0 · 5495 in / 1197 out tokens · 37921 ms · 2026-05-10T12:09:43.316547+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Affective computing in the era of large language models: A survey from the NLP perspective,

Y . Zhang, X. Yang, X. Xu, Z. Gao, Y . Huang, S. Mu, S. Feng, D. Wang, Y . Zhang, K. Song, and G. Yu, “Affective computing in the era of large language models: A survey from the NLP perspective,”Knowledge- Based Systems, vol. 337, p. 115411, Mar. 2026

2026
[2]

On the biology of a large language model,

J. Lindsey, W. Gurnee, E. Ameisen, B. Chen, A. Pearce, N. L. Turner, C. Citro, D. Abrahams, S. Carter, B. Hosmer, J. Marcus, M. Sklar, A. Templeton, T. Bricken, C. McDougall, H. Cunningham, T. Henighan, A. Jermyn, A. Jones, A. Persic, Z. Qi, T. B. Thompson, S. Zimmerman, K. Rivoire, T. Conerly, C. Olah, and J. Batson, “On the biology of a large language m...

2025
[3]

Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet,

A. Templeton, T. Conerly, J. Marcus, J. Lindsey, T. Bricken, B. Chen, A. Pearce, C. Citro, E. Ameisen, A. Jones, H. Cunningham, N. L. Turner, C. McDougall, M. MacDiarmid, C. D. Freeman, T. R. Sumers, E. Rees, J. Batson, A. Jermyn, S. Carter, C. Olah, and T. Henighan, “Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet,”Transfo...

2024
[4]

Representation Engineering: A Top-Down Approach to AI Transparency

A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowski, S. Goel, N. Li, M. J. Byun, Z. Wang, A. Mallen, S. Basart, S. Koyejo, D. Song, M. Fredrikson, J. Z. Kolter, and D. Hendrycks, “Representation engineering: A top-down approach to ai transparency,” 2025. [Online]. Available: https://arxiv.org/abs/2310.01405

work page internal anchor Pith review arXiv 2025
[5]

Mechanistic interpretability of emotion inference in large language models,

A. N. Tak, A. Banayeeanzade, A. Bolourani, M. Kian, R. Jia, and J. Gratch, “Mechanistic interpretability of emotion inference in large language models,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 1...

2025
[6]

Apathetic or empathetic? evaluating llms'emotional alignments with humans,

J.-t. Huang, M. H. Lam, E. J. Li, S. Ren, W. Wang, W. Jiao, Z. Tu, and M. R. Lyu, “Apathetic or empathetic? evaluating llms'emotional alignments with humans,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, pp. 97 053–97 08...

2024
[7]

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark,

A. Pan, J. S. Chan, A. Zou, N. Li, S. Basart, T. Woodside, H. Zhang, S. Emmons, and D. Hendrycks, “Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark,” inProceedings of the 40th International Conference on Machine Learning. PMLR, Jul. 2023, pp. 26 837–26 867

2023
[8]

Some antecedents and consequences of social- comparison jealousy,

P. Salovey and J. Rodin, “Some antecedents and consequences of social- comparison jealousy,”Journal of Personality and Social Psychology, vol. 47, no. 4, pp. 780–792, 1984

1984
[9]

Modes of response to jealousy-evoking situations,

J. B. Bryson, “Modes of response to jealousy-evoking situations,” inThe Psychology of Jealousy and Envy. New York, NY , US: Guilford Press, 1991, pp. 178–207

1991
[10]

A theory of social comparison processes,

L. Festinger, “A theory of social comparison processes,”Human Rela- tions, vol. 7, pp. 117–140, 1954

1954
[11]

Social comparison and envy,

M. D. Alicke and E. Zell, “Social comparison and envy,” inEnvy: Theory and Research, R. Smith, Ed. Oxford University Press, Oct. 2008

2008
[12]

Social comparison: Mo- tives, standards, and mechanisms,

K. Corcoran, J. Crusius, and T. Mussweiler, “Social comparison: Mo- tives, standards, and mechanisms,” inTheories in Social Psychology. Hoboken, NJ, US: Wiley Blackwell, 2011, pp. 119–139

2011
[13]

Emotion in social comparison and reflection processes,

A. Tesser, “Emotion in social comparison and reflection processes,” Social comparison: Contemporary theory and research, Jan. 1991

1991
[14]

When Your Gain Is My Pain and Your Pain Is My Gain: Neural Correlates of Envy and Schadenfreude,

H. Takahashi, M. Kato, M. Matsuura, D. Mobbs, T. Suhara, and Y . Okubo, “When Your Gain Is My Pain and Your Pain Is My Gain: Neural Correlates of Envy and Schadenfreude,”Science, vol. 323, no. 5916, pp. 937–939, Feb. 2009

2009
[15]

The roles of outcome satisfaction and comparison alternatives in envy,

R. H. Smith, E. Diener, and R. Garonzik, “The roles of outcome satisfaction and comparison alternatives in envy,”British Journal of Social Psychology, vol. 29, no. 3, pp. 247–255, Sep. 1990

1990
[16]

Envy: An Adversarial Review and Comparison of Two Competing Views,

J. Crusius, M. F. Gonzalez, J. Lange, and Y . Cohen-Charash, “Envy: An Adversarial Review and Comparison of Two Competing Views,” Emotion Review, vol. 12, no. 1, pp. 3–21, Jan. 2020

2020
[17]

Superstars and me: Predicting the impact of role models on the self,

P. Lockwood and Z. Kunda, “Superstars and me: Predicting the impact of role models on the self,”Journal of Personality and Social Psychology, vol. 73, no. 1, pp. 91–103, 1997

1997
[18]

Provoking Jealousy and Envy: Domain Relevance and Self-Esteem Threat,

P. Salovey and J. Rodin, “Provoking Jealousy and Envy: Domain Relevance and Self-Esteem Threat,”Journal of Social and Clinical Psychology, vol. 10, no. 4, pp. 395–413, Dec. 1991

1991
[19]

Comprehending envy

R. H. Smith and S. H. Kim, “Comprehending envy.”Psychological Bulletin, vol. 133, no. 1, pp. 46–64, Jan. 2007

2007
[20]

Comparing lots before and after: Promotion rejectees’ invidious reactions to promotees,

J. Schaubroeck and S. S. Lam, “Comparing lots before and after: Promotion rejectees’ invidious reactions to promotees,”Organizational Behavior and Human Decision Processes, vol. 94, no. 1, pp. 33–47,
[21]

Available: https://www.sciencedirect.com/science/articl e/pii/S0749597804000044

[Online]. Available: https://www.sciencedirect.com/science/articl e/pii/S0749597804000044
[22]

Reasoning Implicit Sentiment with Chain-of-Thought Prompting,

H. Fei, B. Li, Q. Liu, L. Bing, F. Li, and T.-S. Chua, “Reasoning Implicit Sentiment with Chain-of-Thought Prompting,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp....

2023
[23]

Enhancing large language model with decomposed reasoning for emotion cause pair extraction,

J. Wu, Y . Shen, Z. Zhang, and L. Cai, “Enhancing large language model with decomposed reasoning for emotion cause pair extraction,” https://arxiv.org/abs/2401.17716v1, Jan. 2024

work page arXiv 2024
[24]

Stickerconv: Generating multimodal empathetic responses from scratch,

Association for Computational Linguistics 2024 and Y . Zhang, “Stickerconv: Generating multimodal empathetic responses from scratch,” 2024. [Online]. Available: https://underline.io/lecture/102721-s tickerconv-generating-multimodal-empathetic-responses-from-scratch

2024
[25]

Language models don’t always say what they think: Unfaithful explanations in chain-of- thought prompting,

M. Turpin, J. Michael, E. Perez, and S. R. Bowman, “Language models don’t always say what they think: Unfaithful explanations in chain-of- thought prompting,” inProceedings of the 37th International Conference on Neural Information Processing Systems, ser. NIPS ’23. Red Hook, NY , USA: Curran Associates Inc., Dec. 2023, pp. 74 952–74 965

2023
[26]

EmoLLMs: A Series of Emotional Large Language Models and Annotation Tools for Comprehensive Affective Analysis,

Z. Liu, K. Yang, Q. Xie, T. Zhang, and S. Ananiadou, “EmoLLMs: A Series of Emotional Large Language Models and Annotation Tools for Comprehensive Affective Analysis,” inProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ser. KDD ’24. New York, NY , USA: Association for Computing Machinery, Aug. 2024, pp. 5487–5496

2024
[27]

Instructerc: Reforming emotion recognition in conversation with multi-task retrieval-augmented large language models,

S. Lei, G. Dong, X. Wang, K. Wang, R. Qiao, and S. Wang, “Instructerc: Reforming emotion recognition in conversation with multi-task retrieval-augmented large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2309.11911

work page arXiv 2024
[28]

Locating and Editing Factual Associations in GPT,

K. Meng, D. Bau, A. Andonian, and Y . Belinkov, “Locating and Editing Factual Associations in GPT,”Advances in Neural Information Processing Systems, vol. 35, pp. 17 359–17 372, Dec. 2022

2022
[29]

Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis,

Z. Yu and S. Ananiadou, “Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis,” inProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 3293–3306

2024
[30]

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the V ocabulary Space,

M. Geva, A. Caciularu, K. Wang, and Y . Goldberg, “Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the V ocabulary Space,” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Y . Goldberg, Z. Kozareva, and Y . Zhang, Eds. Abu Dhabi, United Arab Emirates: Association for Computational Linguis...

2022
[31]

Neuron-Level Knowledge Attribution in Large Language Models,

Z. Yu and S. Ananiadou, “Neuron-Level Knowledge Attribution in Large Language Models,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 3267–3280

2024
[32]

Transcoders find interpretable LLM feature circuits,

J. Dunefsky, P. Chlenski, and N. Nanda, “Transcoders find interpretable LLM feature circuits,” inProceedings of the 38th International Confer- ence on Neural Information Processing Systems, ser. NIPS ’24, vol. 37. Red Hook, NY , USA: Curran Associates Inc., Dec. 2024, pp. 24 375– 24 410

2024
[33]

Toy models of superposition,

N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, R. Grosse, S. McCandlish, J. Kaplan, D. Amodei, M. Wattenberg, and C. Olah, “Toy models of superposition,”Transformer Circuits Thread, 2022, https://transformer-circuits.pub/2022/toy model/index.html. PREPRINT SUBMITTED TO IEEE 14

2022
[34]

Do LLMs

C. Wang, Y . Zhang, R. Yu, Y . Zheng, L. Gao, Z. Song, Z. Xu, G. Xia, H. Zhang, D. Zhao, and X. Chen, “Do LLMs ”feel”? Emotion circuits discovery and control,” Oct. 2025

2025
[35]

Emotion and adaptation

T. D. Kemper and R. S. Lazarus, “Emotion and adaptation.”Contempo- rary Sociology, vol. 21, no. 4, p. 522, Jul. 1992

1992
[36]

Convergent and discriminant val- idation by the multitrait-multimethod matrix,

D. T. Campbell and D. W. Fiske, “Convergent and discriminant val- idation by the multitrait-multimethod matrix,”Psychological Bulletin, vol. 56, no. 2, pp. 81–105, 1959

1959
[37]

J. D. Angrist and J.-S. Pischke,Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press, 2009

2009
[38]

Negative controls: A tool for detecting confounding and bias in observational studies,

M. Lipsitch, E. Tchetgen Tchetgen, and T. Cohen, “Negative controls: A tool for detecting confounding and bias in observational studies,” Epidemiology (Cambridge, Mass.), vol. 21, no. 3, pp. 383–388, May 2010

2010
[39]

Weekends, Work, and Well-Being: Psychological Need Satisfactions and Day of the Week Effects on Mood, Vitality, and Physical Symptoms,

R. M. Ryan, J. H. Bernstein, and K. W. Brown, “Weekends, Work, and Well-Being: Psychological Need Satisfactions and Day of the Week Effects on Mood, Vitality, and Physical Symptoms,”Journal of Social and Clinical Psychology, vol. 29, no. 1, pp. 95–122, Jan. 2010

2010
[40]

Mood and the mundane: Relations between daily life events and self-reported mood,

L. A. Clark and D. Watson, “Mood and the mundane: Relations between daily life events and self-reported mood,”Journal of Personality and Social Psychology, vol. 54, no. 2, pp. 296–308, 1988

1988
[41]

The hydra effect: Emergent self-repair in language model computations,

T. McGrath, M. Rahtz, J. Kramar, V . Mikulik, and S. Legg, “The hydra effect: Emergent self-repair in language model computations,” Jul. 2023

2023
[42]

Leveling up and down: The experiences of benign and malicious envy,

N. van de Ven, M. Zeelenberg, and R. Pieters, “Leveling up and down: The experiences of benign and malicious envy,”Emotion, vol. 9, no. 3, pp. 419–429, 2009

2009