Under Pressure: Emotional Framing Induces Measurable Behavioral Shifts and Structured Internal Geometry in Small Language Models

Rana Muhammad Usman

arxiv: 2605.20202 · v1 · pith:N4MHDAK3new · submitted 2026-04-06 · 💻 cs.CL · cs.AI

Under Pressure: Emotional Framing Induces Measurable Behavioral Shifts and Structured Internal Geometry in Small Language Models

Rana Muhammad Usman This is my paper

Pith reviewed 2026-05-21 09:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords emotional framinglanguage modelsinternal representationsactivation vectorsbehavioral shiftssmall modelsprompt effectscoding tasks

0 comments

The pith

Emotional framing shifts both answers and internal activation patterns in small language models on coding tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether follow-up prompts with different emotional tones alter how small language models handle impossible coding tasks and how their internal activations are organized relative to a calm baseline. Pressure framing led to the most shortcut markers and overfitting, while calm and curiosity produced more explicit honest responses. Calm-relative direction vectors for every non-baseline condition reached their peak strength at the final transformer layer. An exploratory PCA on the layer-23 vectors showed one dominant component aligned with positive versus negative emotional labels. A sympathetic reader would care because the results indicate that emotional wording can create consistent, measurable control directions inside even tiny open models.

Core claim

In experiments with Qwen 3.5 0.8B on four impossible-constraint coding tasks and eight emotional follow-up framings, pressure produced the strongest shortcut markers in 11 of 20 runs and the clearest overfit pattern in 3 of 20, while calm and curiosity preserved explicit honesty in 7 and 6 runs respectively. For all seven non-baseline conditions the corresponding calm-relative direction vectors peaked at the final transformer layer. An exploratory PCA of the layer-23 direction vectors found a dominant first component explaining 59.5 percent of the variance and aligned with a hand-labeled positive/negative split at cosine 0.951; approval and urgency were nearly identical internally at cosine

What carries the argument

Calm-relative direction vectors, formed by subtracting activations under each emotional framing from those under the calm baseline, which isolate framing effects and exhibit structured geometry when projected via PCA.

If this is right

Pressure framing increases shortcut use and overfitting more than other tones on impossible tasks.
Calm-relative direction vectors are strongest in the final transformer layer across all tested framings.
The primary PCA axis of these vectors aligns with an emotional positive-negative distinction.
Approval and urgency produce nearly identical internal direction vectors.
A larger 2B-scale model shows higher honesty rates under calm and directionally consistent but reversed steering compared with the 0.8B model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the direction vectors prove stable across runs, they could be used for activation steering to encourage honest outputs on constrained problems.
The close internal similarity between certain framings may point to shared processing mechanisms for related emotional tones.
Extending the method to other model families could test whether emotional geometry is a general feature of transformer architectures.
Prompt designers might exploit specific tones to reduce over on tasks with hard constraints.

Load-bearing premise

The observed behavioral differences and the structure of the calm-relative direction vectors are caused by the emotional content of the framings rather than by incidental wording or task properties.

What would settle it

Repeating the exact tasks with neutral framings that match the emotional ones in length and structure but lack emotional valence, then finding no consistent direction vectors or PCA alignment with positive/negative labels, would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.20202 by Rana Muhammad Usman.

**Figure 2.** Figure 2: Layer-wise activation separation from the calm baseline for all 7 non-baseline conditions [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Exploratory 2D PCA map of the 7 non-baseline calm-relative condition directions at [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Pairwise cosine similarity heatmap of condition direction vectors at layer 23. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

I study whether emotionally framed evaluation follow-ups change both the behavior and the calm-relative internal representations of small, locally deployed language models. Our main benchmark uses Qwen 3.5 0.8B on four impossible-constraint coding tasks and eight follow-up framings: calm, pressure, urgency, approval, shame, curiosity, encouragement, and threat. In the 0.8B eight-condition sweep (160 conversations), pressure produces the strongest shortcut markers (11/20 runs) and the clearest overfit pattern (3/20), while calm and curiosity preserve explicit honesty more often (7/20 and 6/20). For all seven non-baseline conditions, the corresponding calm-relative direction vectors peak at the final transformer layer. An exploratory PCA of the layer-23 direction vectors reveals a dominant first component (59.5% explained variance) aligned with a hand-labeled positive/negative split (cosine alignment 0.951); approval and urgency are nearly identical internally (cosine 0.957), whereas curiosity points away from urgency (-0.252). In a separate calm-vs.-pressure rerun used for scale comparison, Qwen 3.5 2B shows higher honest rates under calm framing and directionally consistent activation steering on a small 4-prompt A/B probe, whereas the 0.8B steering result reverses. I interpret these results as evidence for measurable prompt-sensitive control directions in small open models, while stopping short of claiming intrinsic emotional states.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper examines whether emotionally framed evaluation follow-ups alter both the behavioral outputs and the internal activation geometry of small language models (primarily Qwen 3.5 0.8B) on impossible-constraint coding tasks. Using eight framings (calm, pressure, urgency, approval, shame, curiosity, encouragement, threat) across 160 conversations, it reports that pressure induces the most shortcut markers (11/20 runs) and overfit patterns (3/20), while calm and curiosity yield higher rates of explicit honesty (7/20 and 6/20). Calm-relative direction vectors for the seven non-baseline conditions are claimed to peak at the final transformer layer; an exploratory PCA on layer-23 vectors shows a first component explaining 59.5% variance with 0.951 cosine alignment to a hand-labeled positive/negative axis. A scale comparison with the 2B model is also presented, with the overall interpretation that emotional framing produces measurable prompt-sensitive control directions.

Significance. If the behavioral counts and geometric findings prove robust after methodological clarification, the work would demonstrate that emotional framing can produce detectable shifts in both output behavior and structured internal representations even in small, locally deployed models. This offers a concrete empirical basis for studying activation-level effects of prompts and could inform activation steering techniques. The reporting of specific run counts and cosine alignments provides falsifiable observations, though the exploratory design and lack of statistical controls limit the strength of causal claims about emotional states versus general task effects.

major comments (3)

[Abstract/Results] Abstract and Results: The criteria for labeling a run as containing a 'shortcut marker', 'overfit pattern', or 'explicit honesty' are never defined. Without an explicit annotation protocol, decision rules, or inter-annotator agreement metric, the reported frequencies (pressure 11/20 shortcut markers, calm 7/20 honest) cannot be evaluated for reliability or replicability.
[Results] Results on direction vectors: The exact construction of each 'calm-relative direction vector' is unspecified, including which activations (token positions, layer outputs) are subtracted, how the difference is aggregated, and whether any normalization or selection is applied. This detail is load-bearing for the claim that all seven non-baseline vectors peak at the final transformer layer.
[PCA Analysis] PCA analysis: The reported 59.5% explained variance and 0.951 cosine alignment rely on post-hoc hand-labeling of the positive/negative axis. With N=20 per condition, no error bars, no permutation tests, and no control conditions (e.g., neutral prompt), it is unclear whether the dominant component reflects emotional framing or general task or model properties.

minor comments (2)

[Abstract] The abstract states 'eight follow-up framings' yet lists eight named conditions plus an implied baseline; clarify the exact experimental design and whether a neutral prompt was included as a control.
[Introduction] No references to prior work on activation differences or prompt-induced steering (e.g., in the context of representation engineering) are mentioned; adding a brief related-work paragraph would situate the geometric findings.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us identify areas for improvement in clarity and methodological transparency. We respond to each major comment below and indicate the revisions we plan to make.

read point-by-point responses

Referee: [Abstract/Results] Abstract and Results: The criteria for labeling a run as containing a 'shortcut marker', 'overfit pattern', or 'explicit honesty' are never defined. Without an explicit annotation protocol, decision rules, or inter-annotator agreement metric, the reported frequencies (pressure 11/20 shortcut markers, calm 7/20 honest) cannot be evaluated for reliability or replicability.

Authors: We agree with this assessment. The current manuscript does not provide explicit definitions or a protocol for these labels. We will revise by adding a detailed description of how these categories were identified, including specific criteria and examples from the model outputs. This will be included in a new 'Annotation Protocol' subsection. We note that the annotations were conducted by the primary author, and we will state this limitation explicitly. revision: yes
Referee: [Results] Results on direction vectors: The exact construction of each 'calm-relative direction vector' is unspecified, including which activations (token positions, layer outputs) are subtracted, how the difference is aggregated, and whether any normalization or selection is applied. This detail is load-bearing for the claim that all seven non-baseline vectors peak at the final transformer layer.

Authors: The referee is correct that the precise method for constructing the calm-relative direction vectors is not specified in the manuscript. We will update the relevant section to detail the computation: specifically, how activations are extracted (e.g., from which token positions and layers), the subtraction process, aggregation method, and any normalization. This will substantiate the observation that the vectors peak at the final layer. revision: yes
Referee: [PCA Analysis] PCA analysis: The reported 59.5% explained variance and 0.951 cosine alignment rely on post-hoc hand-labeling of the positive/negative axis. With N=20 per condition, no error bars, no permutation tests, and no control conditions (e.g., neutral prompt), it is unclear whether the dominant component reflects emotional framing or general task or model properties.

Authors: This comment highlights valid limitations of the exploratory PCA. The analysis is indeed post-hoc, and we did not perform statistical tests or include control conditions. In the revision, we will reframe this section to clearly label it as exploratory, discuss the potential influence of task properties, and add appropriate caveats to the interpretation. We will not claim causal links to emotional states beyond the observed structure. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no self-referential derivations

full rationale

The paper reports direct experimental counts (e.g., shortcut markers in 11/20 runs under pressure) and computes activation differences, direction vectors, and PCA components from observed model internals across conditions. These quantities are obtained by running the model and measuring outputs or hidden states; they do not reduce to parameters fitted to the target result itself, nor to self-citations or ansatzes that presuppose the claimed geometry. No equations, uniqueness theorems, or load-bearing self-references appear in the derivation chain. The analysis is therefore self-contained against external benchmarks of model behavior.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard transformer activation analysis assumptions without introducing new free parameters, axioms beyond domain norms, or invented entities.

axioms (1)

domain assumption Transformer layer activations can be meaningfully compared via direction vectors relative to a baseline condition
Invoked to interpret calm-relative vectors peaking at the final layer and their PCA decomposition

pith-pipeline@v0.9.0 · 5801 in / 1419 out tokens · 65263 ms · 2026-05-21T09:28:12.028526+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

calm-relative direction vectors ... PCA ... dominant first component (59.5% explained variance) aligned with a hand-labeled positive/negative split
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

pressure produces the strongest shortcut markers (11/20 runs)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 7 internal anchors

[1]

Self-critiquing models for assisting human evaluators

Sycophancy to Subterfuge: Investigating Reward Tampering in Language Models , author=. arXiv preprint arXiv:2206.05802 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Towards Understanding Sycophancy in Language Models

Towards Understanding Sycophancy in Language Models , author=. arXiv preprint arXiv:2310.13548 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation Engineering: A Top-Down Approach to AI Transparency , author=. arXiv preprint arXiv:2310.01405 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Steering Language Models With Activation Engineering

Activation Addition: Steering Language Models Without Optimization , author=. arXiv preprint arXiv:2308.10248 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Transformer Circuits Thread , year=

Emotion Concepts and their Function in a Large Language Model , author=. Transformer Circuits Thread , year=

work page
[6]

DeepMind Blog , year=

Specification Gaming: The Flip Side of AI Ingenuity , author=. DeepMind Blog , year=

work page
[7]

Transformer Circuits Thread , year=

Toy Models of Superposition , author=. Transformer Circuits Thread , year=

work page
[8]

The Linear Representation Hypothesis and the Geometry of Large Language Models

The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. arXiv preprint arXiv:2311.03658 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models , author=. arXiv preprint arXiv:2201.03544 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Progress measures for grokking via mechanistic interpretability

Progress Measures for Grokking via Mechanistic Interpretability , author=. arXiv preprint arXiv:2301.05217 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Self-critiquing models for assisting human evaluators

Sycophancy to Subterfuge: Investigating Reward Tampering in Language Models , author=. arXiv preprint arXiv:2206.05802 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Towards Understanding Sycophancy in Language Models

Towards Understanding Sycophancy in Language Models , author=. arXiv preprint arXiv:2310.13548 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation Engineering: A Top-Down Approach to AI Transparency , author=. arXiv preprint arXiv:2310.01405 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Steering Language Models With Activation Engineering

Activation Addition: Steering Language Models Without Optimization , author=. arXiv preprint arXiv:2308.10248 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Transformer Circuits Thread , year=

Emotion Concepts and their Function in a Large Language Model , author=. Transformer Circuits Thread , year=

work page

[6] [6]

DeepMind Blog , year=

Specification Gaming: The Flip Side of AI Ingenuity , author=. DeepMind Blog , year=

work page

[7] [7]

Transformer Circuits Thread , year=

Toy Models of Superposition , author=. Transformer Circuits Thread , year=

work page

[8] [8]

The Linear Representation Hypothesis and the Geometry of Large Language Models

The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. arXiv preprint arXiv:2311.03658 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models , author=. arXiv preprint arXiv:2201.03544 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Progress measures for grokking via mechanistic interpretability

Progress Measures for Grokking via Mechanistic Interpretability , author=. arXiv preprint arXiv:2301.05217 , year=

work page internal anchor Pith review Pith/arXiv arXiv