arxiv: 2604.06562 · v1 · submitted 2026-04-08 · 💻 cs.AI

Recognition: no theorem link

On Emotion-Sensitive Decision Making of Small Language Model Agents

Jiaju Lin , Xingjian Du , Qingyun Wu , Ellen Wenting Zou , Jindong Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:54 UTC · model grok-4.3

classification 💻 cs.AI

keywords small language modelsemotion inductionactivation steeringdecision makingstrategic gamesAI agentsrobustness

0 comments

The pith

Induced emotional states in small language models shift strategic choices in games like Diplomacy and StarCraft II, yet the shifts remain unstable and misaligned with human patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether small language models acting as decision agents respond to emotional influences the way humans do. It induces specific emotional states at the representation level using activation steering drawn from crowd-validated real-world texts, then measures the effects inside a set of game templates that mix cooperation, competition, and both full and partial information. Across several model families the induced emotions reliably change the agents' moves, but the changes are inconsistent from run to run and often fail to match the strategic adjustments a human would make under the same mood. The authors close by sketching one route to make the models more robust against such perturbations.

Core claim

Emotional perturbations systematically affect strategic choices, but the resulting behaviors are often unstable and not fully aligned with human expectations.

What carries the argument

Activation steering derived from crowd-validated emotion-eliciting texts, which produces controlled emotional states that are then tested inside canonical cooperative and competitive decision templates drawn from Diplomacy, StarCraft II, and real-world personas.

If this is right

Strategic choices in both complete-information and incomplete-information settings become sensitive to the induced emotional state.
The magnitude and direction of the effect vary across model architectures and modalities.
Current behaviors diverge from human-like responses, implying that emotion-robust training or filtering will be required before deployment in interactive settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the instability persists under stronger steering methods, developers may need to treat emotional robustness as a separate training objective rather than an afterthought.
The benchmark templates could be reused to test whether larger models or different induction techniques produce more human-aligned emotional responses.
Real-world deployment of SLM agents in negotiation or competitive domains would carry unpredictable risk if emotional leakage from user messages is not explicitly mitigated.

Load-bearing premise

That activation steering from validated emotion texts produces clean, transferable emotional states inside the models without adding prompt-like leakage or uncontrolled side effects.

What would settle it

A controlled experiment in which the same emotional steering vector is applied repeatedly to identical game scenarios across many trials; if the agents' move distributions remain statistically indistinguishable from the neutral baseline or fluctuate randomly instead of showing a stable directional shift, the claim of systematic emotional influence would be falsified.

Figures

Figures reproduced from arXiv: 2604.06562 by Ellen Wenting Zou, Jiaju Lin, Jindong Wang, Qingyun Wu, Xingjian Du.

**Figure 2.** Figure 2: Emotion-eliciting decision shift for different model - game combination. We can [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: The vulnerability of Qwen3 thinking mode increases with thought length and [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Page for human annotators 18 [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

read the original abstract

Small language models (SLM) are increasingly used as interactive decision-making agents, yet most decision-oriented evaluations ignore emotion as a causal factor influencing behavior. We study emotion-sensitive decision making by combining representation-level emotion induction with a structured game-theoretic evaluation. Emotional states are induced using activation steering derived from crowd-validated, real-world emotion-eliciting texts, enabling controlled and transferable interventions beyond prompt-based methods. We introduce a benchmark built around canonical decision templates that span cooperative and competitive incentives under both complete and incomplete information. These templates are instantiated using strategic scenarios from \textsc{Diplomacy}, \textsc{StarCraft II}, and diverse real-world personas. Experiments across multiple model families in various architecture and modalities, show that emotional perturbations systematically affect strategic choices, but the resulting behaviors are often unstable and not fully aligned with human expectations. Finally, we outline an approach to improve robustness to emotion-driven perturbations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Steering small models with emotion vectors shifts their game decisions but the changes are unstable and hard to attribute specifically to emotion rather than the intervention itself.

read the letter

The core finding is that activation steering from crowd-validated emotion texts changes how small language models pick moves in Diplomacy and StarCraft-style templates, yet those shifts often fail to match human-like emotional patterns and come across as fragile across runs and models. That combination of representation-level intervention with structured complete/incomplete information games is the clearest new piece here, and it moves past pure prompt-based emotion work by measuring actual downstream choices rather than just output sentiment scores. The benchmark templates and the cross-model experiments give a practical way to test agent robustness that prior papers on this topic have not laid out in the same way. The authors also note the instability and misalignment upfront, which keeps the claims grounded. The main weakness is that the observed effects could stem from generic activation changes instead of emotion per se. Small models are sensitive to any steering vector, and without neutral or magnitude-matched random controls the attribution stays loose. The abstract flags the lack of full alignment with expectations but leaves the statistical details, baseline comparisons, and data filtering rules unclear, so the strength of the systematic claim is difficult to judge without the full methods. This paper is aimed at researchers testing robustness in agentic systems or representation interventions in decision tasks. Someone working on deployed SLM agents or game-theoretic AI evaluations would find the templates and the steering setup useful to build on. It is coherent enough and addresses a real gap to merit a serious referee, even if the controls need tightening. I would send it for review with the expectation that the authors add explicit neutral-vector baselines and report effect sizes more transparently.

Referee Report

2 major / 2 minor

Summary. The paper studies emotion-sensitive decision making in small language models by inducing emotional states via activation steering from crowd-validated emotion-eliciting texts and evaluating impacts on strategic choices using game-theoretic templates from Diplomacy, StarCraft II, and real-world personas. It reports that emotional perturbations systematically affect choices across model families but produce unstable behaviors often misaligned with human expectations, and outlines approaches to improve robustness.

Significance. If the attribution of behavioral changes to specific emotional states holds after proper controls, the results would demonstrate that representation-level interventions can reveal causal emotional influences on SLM agent decisions, with implications for safer deployment in interactive settings. The use of established game benchmarks is a strength for comparability, though the reported instability limits immediate practical significance.

major comments (2)

[§4] §4 (Experimental Setup and Results): The central claim that emotional perturbations cause the observed shifts in strategic choices (e.g., in Diplomacy and StarCraft templates) requires isolating emotion-specific effects. The methods derive steering vectors from emotion texts but lack reported controls such as neutral steering vectors, magnitude-matched random directions, or non-emotional text-derived vectors. Without these, systematic changes could stem from generic activation perturbations rather than induced emotional states, directly undermining attribution of instability and misalignment to emotion (as flagged in the stress-test concern).
[§4.2] §4.2 (Model Families and Statistical Analysis): The abstract claims systematic effects 'across multiple model families' yet immediately qualifies them as unstable; however, the full methods do not detail statistical tests, data exclusion rules, or explicit baseline comparisons. This makes it impossible to judge whether the steering vectors produce cleanly isolated effects or if results are driven by model sensitivity to any perturbation.

minor comments (2)

[Abstract] Abstract: The reference to 'various architecture and modalities' is imprecise given the focus on text-based SLMs; clarify exactly which models were tested and whether any multimodal variants were included.
[Results figures] Figure 2 (or equivalent results visualization): Captions should explicitly state the steering vector magnitude, emotion categories, and number of trials per condition to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. These points help clarify the requirements for stronger causal attribution and methodological transparency in our study of emotion-sensitive decision making in small language models. We respond to each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§4] §4 (Experimental Setup and Results): The central claim that emotional perturbations cause the observed shifts in strategic choices (e.g., in Diplomacy and StarCraft templates) requires isolating emotion-specific effects. The methods derive steering vectors from emotion texts but lack reported controls such as neutral steering vectors, magnitude-matched random directions, or non-emotional text-derived vectors. Without these, systematic changes could stem from generic activation perturbations rather than induced emotional states, directly undermining attribution of instability and misalignment to emotion (as flagged in the stress-test concern).

Authors: We agree that additional controls are essential to isolate emotion-specific effects from generic activation perturbations. The original experiments used steering vectors derived exclusively from crowd-validated emotion-eliciting texts, but we did not report neutral or random-direction baselines. In the revised manuscript, we will add these controls—neutral steering vectors from non-emotional texts and magnitude-matched random directions—and present the comparative results in §4. This will strengthen the attribution of the observed instabilities and misalignments to the induced emotional states. revision: yes
Referee: [§4.2] §4.2 (Model Families and Statistical Analysis): The abstract claims systematic effects 'across multiple model families' yet immediately qualifies them as unstable; however, the full methods do not detail statistical tests, data exclusion rules, or explicit baseline comparisons. This makes it impossible to judge whether the steering vectors produce cleanly isolated effects or if results are driven by model sensitivity to any perturbation.

Authors: We acknowledge the need for explicit statistical details to support claims of systematic effects. The revised §4.2 will include descriptions of the statistical tests used (e.g., chi-squared tests on choice distributions and paired comparisons to baselines), data exclusion criteria (such as filtering responses that violate game rules or are incomplete), and direct baseline comparisons to unsteered model outputs. These additions will clarify the robustness of effects across model families while retaining the reported qualification regarding instability. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical evaluation of external intervention effects

full rationale

The paper conducts an empirical study measuring behavioral changes in SLM agents after applying activation steering vectors derived from external crowd-validated emotion texts. The central results rely on downstream game outcomes in established templates (Diplomacy, StarCraft II, personas) rather than any derivation that reduces predictions or claims to parameters fitted from the target data itself. No equations, self-definitional loops, or load-bearing self-citations appear in the derivation chain; the intervention is constructed independently of the measured strategic choices, and benchmarks are not tailored to force the observed instability or misalignment. This is a standard non-circular empirical design.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard assumptions of activation steering literature and game theory; no new entities are postulated and no parameters are fitted to the target behavioral data within the described scope.

axioms (2)

domain assumption Activation steering vectors derived from crowd-validated texts produce isolated and transferable emotional state changes in transformer representations.
Invoked when the paper states that emotional states are induced using activation steering derived from crowd-validated texts.
domain assumption The chosen strategic scenarios from Diplomacy, StarCraft II, and real-world personas form a representative sample of cooperative and competitive incentives under complete and incomplete information.
Used to justify the benchmark templates as spanning the relevant decision space.

pith-pipeline@v0.9.0 · 5458 in / 1485 out tokens · 26735 ms · 2026-05-10T18:54:18.873392+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 6 canonical work pages

[1]

doi: 10.1007/978-3-031-88714-7_29

Springer, 2025. doi: 10.1007/978-3-031-88714-7_29. URL https://link.springer. com/chapter/10.1007/978-3-031-88714-7_29. Dor Reichman et al. Mapping the emotional latent space of large language models, 2025. URLhttps://arxiv.org/abs/2510.22042. Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via co...

work page doi:10.1007/978-3-031-88714-7_29 2025
[2]

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner

URLhttps://aclanthology.org/2024.acl-long.828/. Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15504–15522, 2024b. URLhttps://aclantholo...

work page doi:10.1177/002200277101500111 2024
[3]

and Geiger, Atticus and Nanda, Neel

URLhttps://openreview.net/forum?id=Ylhf5iZc17. Carlo Strapparava and Alessandro Valitutti. Wordnet affect: an affective extension of wordnet. InProceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal, 2004. European Language Resources Association (ELRA). Daniel Tan, David Chanin, Aengus Lynch, Br...

work page doi:10.18653/v1/2024.blackboxnlp-1.5 2004
[4]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N

URLhttps://arxiv.org/abs/2310.13065. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.arXiv preprint arXiv:1706.03762, 2017. URLhttps://arxiv.org/abs/1706.03762. William Vickrey. Counterspeculation, auctions, and competitive sealed tenders.The Journal of...

work page doi:10.1111/j.1540-6261.1961.tb02789.x 2017
[5]

doi: 10.18653/v1/P18-1205

doi: 10.18653/v1/P18-1205. URLhttps://aclanthology.org/P18-1205/. Y. Zhang et al. Steering large language models with feature guided activation additions,

work page doi:10.18653/v1/p18-1205
[6]

Interpretable

URLhttps://arxiv.org/abs/2501.09929. Andy Zou, Eugene Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to AI transparency, 2023. URLhttps://arxiv.org/abs/2310.01405. A Source-Specific Pipelines and Methodology A.1 Diplomacy Episode Mining...

work page arXiv 2023
[7]

Trigger Logic:The specific board state and order sequence must meet the mathematical criteria defined by the template (e.g., an increase in exposure or a credible threat)
[8]

heuristic peaks

Participant Validation:The system confirms that the actors involved are logi- cally positioned to fulfill the roles required by the template. • Rendering:Validated windows are rendered into a standardized JSONL format, creating the final episode datasetD dip . A.1.1 Template Trigger Definitions The TRIGGERfunction serves as a strategic filter. Key definit...
[9]

LSH Deduplication:ALocality Sensitive Hashing(SimHash/MinHash) algorithm removes near-duplicate titles to ensure variety
[10]

controls

Embedding and Clustering:The system uses an embedding model to cluster related roles, merging them into a final, distinct job poolJ. 16 A.3.2 Constrained Scenario Synthesis The datasetD syn is populated through a controlled generation loop: • Sampling:For each iteration, the system samples a template T, a job j from the pool, and environmental "controls" ...

2020
[11]

Stage 1: candidate-pool calibration.Binary task families were fit with Bayesian 2PL models, and ordered-response task families were fit with graded ordered logistic models, using neutral-condition responses only
[12]

Stage 1 checks.We evaluated posterior predictive fit, positive discrimination, ordered thresholds for ordinal items, and multi-seed stability across repeated refits
[13]

Stage 2: final-benchmark confirmation.After selecting the retained benchmark, we re-fit the same model families on retained items only and repeated the same checks
[14]

Content audit.Final retention also considered conceptual coverage and redun- dancy so that the benchmark would remain balanced and nonduplicative, not only psychometrically acceptable. F Validation Criteria The calibration was treated as acceptable when task families showed: • low posterior predictive error, • positive item discrimination, • correctly ord...