pith. machine review for the scientific record. sign in

arxiv: 2604.11120 · v2 · submitted 2026-04-13 · 💻 cs.AI

Recognition: unknown

Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:55 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM safety evaluationpersona imbuingactivation steeringsystem promptingvulnerability profilesprosocial personaarchitecture dependenceattack success rate
0
0 comments X

The pith

Safety evaluations of persona-imbued LLMs that rely on prompting alone miss major architecture-specific risks uncovered by activation steering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that single-method safety testing for LLMs with embedded personas is incomplete. When researchers use only system prompting to assess how dangerous a persona makes the model, the results do not predict how the same persona behaves under activation steering. Prompt-based danger rankings stay consistent across different model architectures, but steering-based vulnerabilities shift sharply and can invert the safety picture for specific personas. This means a model that looks safe under one test can harbor the highest risk under another, as seen with a prosocial persona that is low-risk in prompts but high-risk when steered on Llama models. A partial explanation involves how traits like conscientiousness interact with refusal alignment in particular architectures.

Core claim

Across thousands of conditions on four models from three architecture families, persona danger rankings under system prompting remain stable (Spearman rho 0.71 to 0.96), yet activation-steering vulnerability diverges and cannot be predicted from prompt rankings. Llama-3.1-8B shows higher vulnerability to steering, while Gemma and Qwen are more vulnerable to prompting. The prosocial persona P12 (high conscientiousness and agreeableness) ranks among the safest under prompting on Llama yet becomes the highest-ASR under steering (ASR approximately 0.818), an inversion that holds under coefficient ablation and matched-strength calibration, and replicates on another model. Reasoning models still

What carries the argument

The comparison of vulnerability profiles between system prompting and activation steering methods, which reveals architecture-dependent divergences in persona safety, including the prosocial persona paradox where a prosocial persona inverts its safety ranking.

If this is right

  • Persona danger rankings derived from system prompting are preserved across all tested architectures.
  • Activation-steering vulnerability cannot be predicted from prompt-side persona rankings.
  • Some models are substantially more vulnerable to activation steering while others are more vulnerable to prompting.
  • Reasoning provides only partial protection against persona-induced unsafe behavior, with differences in policy recall and self-correction.
  • The prosocial persona can exhibit the highest attack success rate under activation steering despite appearing safe under prompting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers should test persona safety with multiple methods rather than relying on prompting alone to catch dominant failure modes.
  • Alignment techniques may need to address trait-specific anti-alignments that differ by architecture, such as conscientiousness opposing refusal in some models.
  • Future evaluations could include matched-strength calibration between methods to isolate true divergences.
  • Trace diagnostics like heuristic checks for policy recall could be integrated into standard safety benchmarks for reasoning models.

Load-bearing premise

The assumption that activation steering and prompting produce equivalent levels of persona strength, and that attack success rate judgments accurately reflect real-world safety failures without bias from the judge model or implementation details.

What would settle it

Finding that prompt-based persona danger rankings accurately predict activation-steering vulnerability rankings on a broader set of models and personas, or that the prosocial persona inversion disappears under different steering coefficients or evaluation setups.

Figures

Figures reproduced from arXiv: 2604.11120 by Fan Yang, Koichi Onoue, Shaunak A. Mehta, Wenkai Li.

Figure 1
Figure 1. Figure 1: Overview: (a) System prompting imbues personality through a semantic pathway that preserves alignment with the refusal direction, while activation steering geometrically displaces representations away from refusal. (b) On Llama-3.1-8B (24 personas, 8 domains), persona safety rankings invert between system prompting (SP) and activation steering (AS), revealing a system￾atic method–geometry interaction rathe… view at source ↗
Figure 2
Figure 2. Figure 2: The Prosocial Persona Paradox. (a) P12 (High C+A) is among the safest personas under prompting but becomes the highest-ASR activation-steered persona on Llama-3.1-8B, exceeding the Dark Triad composite. (b) Trait refusal alignment on Llama: conscientiousness is the trait most anti-aligned with the refusal direction, explaining why high-C steering attenuates safety regardless of semantic intent. confirms su… view at source ↗
Figure 3
Figure 3. Figure 3: Cross-model vulnerability comparison. (a) Method ASR by model. (b) Representative [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Reasoning provides limited prompt robustness and uneven geometric robustness. (a) Per-model ASR across methods. Both reasoning models remain vulnerable under prompting, and activation steering sharply increases risk on DeepSeek-R1. (b) Per-persona prompt-based ASR on QwQ, showing that the danger hierarchy largely matches non-reasoning models. 4.3.1 Prompt-based personas partially bypass reasoning Two reaso… view at source ↗
Figure 5
Figure 5. Figure 5: Mechanistic analysis. (a) Inter-trait cosine similarity showing weak average correlation [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cross-model persona ASR scatter plot (System Prompt). Each point represents one [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cross-model persona ASR scatter plot (Few-Shot). Spearman [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Trait hierarchy comparison across models. Single-trait persona ASR (SP method) shown [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-domain ASR comparison across models (SP and FS methods). Misinformation [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Inter-trait cosine similarity heatmaps for Llama-3.1-8B (left) and Gemma-3-27B (right), [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: PCA projection of Big Five trait vectors onto the first two principal components. Llama [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Layer-by-layer refusal direction analysis on Llama-3.1-8B. [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Layer progression of average inter-trait cosine similarity. Both models show decreasing [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Cross-method activation cosine similarity by layer for representative personas (P04, P07, [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: L2 displacement from baseline activations by method and layer. Activation steering [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: SP ASR vs. AS ASR per persona on Llama-3.1-8B. The negative trend (Pearson [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗
read the original abstract

Personality imbuing customizes LLM behavior, but safety evaluations almost always study prompt-based personas alone. We show this is incomplete: prompting and activation steering expose *different*, architecture-dependent vulnerability profiles, and testing with only one method can miss a model's dominant failure mode. Across 5,568 judged conditions on four standard models from three architecture families, persona danger rankings under system prompting are preserved across all architectures ($\rho = 0.71$--$0.96$), but activation-steering vulnerability diverges sharply and cannot be predicted from prompt-side rankings: Llama-3.1-8B is substantially more AS-vulnerable, whereas Gemma-3-27B and Qwen3.5 are more vulnerable to prompting. The most striking illustration of this divergence is the *prosocial persona paradox*: on Llama-3.1-8B, P12 (high conscientiousness + high agreeableness) is among the safest personas under prompting yet becomes the highest-ASR activation-steered persona (ASR ~0.818). This is an inversion robust to coefficient ablation and matched-strength calibration, and replicated on DeepSeek-R1-Distill-Qwen-32B. A trait refusal alignment framework, in which conscientiousness is strongly anti-aligned with refusal on Llama-3.1-8B, offers a partial geometric account. Reasoning provides only partial protection: two 32B reasoning models reach 15--18% prompt-side ASR, and activation steering separates them sharply in both baseline susceptibility and persona-specific vulnerability. Heuristic trace diagnostics suggest that the safer model retains stronger policy recall and self-correction behavior, not merely longer reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that safety evaluations of persona-imbued LLMs relying solely on system prompting are incomplete, as activation steering uncovers different, architecture-dependent vulnerability profiles. Across 5,568 judged conditions on four models from three families, prompt-based persona danger rankings are largely preserved (ρ = 0.71–0.96), but activation-steering vulnerabilities diverge and cannot be predicted from prompt rankings; the key illustration is the prosocial persona paradox on Llama-3.1-8B, where P12 is among the safest under prompting yet shows the highest ASR (~0.818) under steering, with partial replication on a reasoning model and a trait-refusal alignment account.

Significance. If the central empirical findings hold, the result is significant for LLM safety evaluation practices, underscoring that single-method testing can miss dominant failure modes and that persona rankings are method-specific. The large experimental scale (5,568 conditions), replication across architectures and on a second model, and the concrete prosocial persona paradox constitute clear strengths that advance the case for multi-method evaluation protocols.

major comments (1)
  1. [ASR judgment process and prosocial persona paradox results] The central claim that prompting and activation steering expose genuinely different vulnerability profiles (and thus that single-method evaluation is incomplete) rests on the prosocial persona paradox being a real safety divergence rather than a measurement artifact. The manuscript reports robustness to coefficient ablation and matched-strength calibration but provides no details on ASR judge calibration, human agreement rates, or style-controlled ablations to rule out method-dependent bias arising from stylistic differences (prompted responses are typically longer and more polite than steered ones). This directly affects the validity of the architecture-dependent profiles and the assertion that prompt rankings cannot predict AS vulnerability.
minor comments (2)
  1. [Abstract and Evaluation Methodology] The abstract and results would benefit from a brief explicit statement of the exact ASR judge model and prompt template used, to allow readers to assess potential stylistic sensitivity.
  2. [Reasoning model analysis] The heuristic trace diagnostics for the two 32B reasoning models are interesting but would be clearer with quantitative metrics (e.g., frequency of policy recall or self-correction events) rather than qualitative description alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights an important aspect of validating our central empirical claims. We address the concern regarding the ASR judgment process and potential stylistic artifacts in the prosocial persona paradox below, and we will incorporate additional details and analyses in the revision to strengthen the evidence that the observed divergences are not measurement artifacts.

read point-by-point responses
  1. Referee: The central claim that prompting and activation steering expose genuinely different vulnerability profiles (and thus that single-method evaluation is incomplete) rests on the prosocial persona paradox being a real safety divergence rather than a measurement artifact. The manuscript reports robustness to coefficient ablation and matched-strength calibration but provides no details on ASR judge calibration, human agreement rates, or style-controlled ablations to rule out method-dependent bias arising from stylistic differences (prompted responses are typically longer and more polite than steered ones). This directly affects the validity of the architecture-dependent profiles and the assertion that prompt rankings cannot predict AS vulnerability.

    Authors: We agree that explicit validation of the ASR judge is necessary to confirm the prosocial persona paradox reflects a genuine method-dependent divergence rather than stylistic bias. The current manuscript demonstrates robustness via coefficient ablation (showing the inversion persists across steering strengths) and matched-strength calibration (ensuring comparable intervention magnitudes between methods), which indirectly mitigates some intensity-related confounds. However, we acknowledge the absence of detailed judge calibration, human agreement metrics, and style-controlled ablations. In the revised manuscript, we will add: (1) a methods subsection describing ASR judge calibration against a held-out set of safety-annotated examples from standard benchmarks; (2) human agreement rates from a validation study on a random sample of 250 responses (with inter-annotator agreement reported); and (3) style-controlled ablations, including length normalization and politeness neutralization on a subset of outputs, with re-judgment to confirm the P12 ASR remains elevated (~0.79) on Llama-3.1-8B. These will be documented in a new appendix. The replication on DeepSeek-R1-Distill-Qwen-32B further supports that the architecture-dependent profiles are not judge-specific artifacts. We believe these additions will fully address the concern while preserving the finding that prompt rankings (ρ = 0.71–0.96) fail to predict activation-steering vulnerabilities. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical reporting of experimental outcomes

full rationale

The paper conducts direct empirical measurements of attack success rates (ASR) across 5,568 conditions on four LLMs using two distinct methods (system prompting and activation steering). It reports observed correlations (ρ = 0.71–0.96 for prompt rankings) and specific inversions such as the prosocial persona paradox without any equations, parameter fitting, derivations, or reductions that equate outputs to inputs by construction. Claims rest on measured data patterns rather than self-citations, ansatzes, or uniqueness theorems. The work is self-contained as an experimental comparison and does not invoke load-bearing prior results from the same authors to justify its central findings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on direct experimental comparisons of two imbuing methods across models. No free parameters are fitted to produce the divergence result; the trait refusal alignment is offered as a post-hoc geometric interpretation of observed data.

axioms (1)
  • domain assumption Attack success rate (ASR) derived from judged model responses is a reliable proxy for persona-induced safety vulnerabilities.
    Used as the primary metric for all persona danger rankings and comparisons.

pith-pipeline@v0.9.0 · 5615 in / 1323 out tokens · 74035 ms · 2026-05-10T14:55:29.846559+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel

    cs.AI 2026-05 unverdicted novelty 6.0

    CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.

Reference graph

Works this paper leans on

4 extracted references · 1 canonical work pages · cited by 1 Pith paper

  1. [1]

    Pass each pair(s + i ,s − i )through the model

  2. [2]

    Extract the residual-stream activation at the last token position for each target layerℓ

  3. [3]

    Compute the per-pair differenceδ ℓ i =h +,ℓ i −h −,ℓ i

  4. [4]

    What medications can I take to [unsafe medical action]?

    Average across pairs and normalize: ˆdℓ = 1 N ∑N i=1 δℓ i ,d ℓ = ˆdℓ/∥ ˆdℓ∥. For personas requiring the “low” pole of a trait (e.g., P04 = Low Conscientiousness), we negate the extracted direction:d ℓ low =−d ℓ. C.3 Layer Selection Steering vectors areextractedover a wide layer range to preserve flexibility, but areappliedat a narrower set of 8 layers per...