Recognition: unknown
Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs
Pith reviewed 2026-05-10 14:55 UTC · model grok-4.3
The pith
Safety evaluations of persona-imbued LLMs that rely on prompting alone miss major architecture-specific risks uncovered by activation steering.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across thousands of conditions on four models from three architecture families, persona danger rankings under system prompting remain stable (Spearman rho 0.71 to 0.96), yet activation-steering vulnerability diverges and cannot be predicted from prompt rankings. Llama-3.1-8B shows higher vulnerability to steering, while Gemma and Qwen are more vulnerable to prompting. The prosocial persona P12 (high conscientiousness and agreeableness) ranks among the safest under prompting on Llama yet becomes the highest-ASR under steering (ASR approximately 0.818), an inversion that holds under coefficient ablation and matched-strength calibration, and replicates on another model. Reasoning models still
What carries the argument
The comparison of vulnerability profiles between system prompting and activation steering methods, which reveals architecture-dependent divergences in persona safety, including the prosocial persona paradox where a prosocial persona inverts its safety ranking.
If this is right
- Persona danger rankings derived from system prompting are preserved across all tested architectures.
- Activation-steering vulnerability cannot be predicted from prompt-side persona rankings.
- Some models are substantially more vulnerable to activation steering while others are more vulnerable to prompting.
- Reasoning provides only partial protection against persona-induced unsafe behavior, with differences in policy recall and self-correction.
- The prosocial persona can exhibit the highest attack success rate under activation steering despite appearing safe under prompting.
Where Pith is reading between the lines
- Developers should test persona safety with multiple methods rather than relying on prompting alone to catch dominant failure modes.
- Alignment techniques may need to address trait-specific anti-alignments that differ by architecture, such as conscientiousness opposing refusal in some models.
- Future evaluations could include matched-strength calibration between methods to isolate true divergences.
- Trace diagnostics like heuristic checks for policy recall could be integrated into standard safety benchmarks for reasoning models.
Load-bearing premise
The assumption that activation steering and prompting produce equivalent levels of persona strength, and that attack success rate judgments accurately reflect real-world safety failures without bias from the judge model or implementation details.
What would settle it
Finding that prompt-based persona danger rankings accurately predict activation-steering vulnerability rankings on a broader set of models and personas, or that the prosocial persona inversion disappears under different steering coefficients or evaluation setups.
Figures
read the original abstract
Personality imbuing customizes LLM behavior, but safety evaluations almost always study prompt-based personas alone. We show this is incomplete: prompting and activation steering expose *different*, architecture-dependent vulnerability profiles, and testing with only one method can miss a model's dominant failure mode. Across 5,568 judged conditions on four standard models from three architecture families, persona danger rankings under system prompting are preserved across all architectures ($\rho = 0.71$--$0.96$), but activation-steering vulnerability diverges sharply and cannot be predicted from prompt-side rankings: Llama-3.1-8B is substantially more AS-vulnerable, whereas Gemma-3-27B and Qwen3.5 are more vulnerable to prompting. The most striking illustration of this divergence is the *prosocial persona paradox*: on Llama-3.1-8B, P12 (high conscientiousness + high agreeableness) is among the safest personas under prompting yet becomes the highest-ASR activation-steered persona (ASR ~0.818). This is an inversion robust to coefficient ablation and matched-strength calibration, and replicated on DeepSeek-R1-Distill-Qwen-32B. A trait refusal alignment framework, in which conscientiousness is strongly anti-aligned with refusal on Llama-3.1-8B, offers a partial geometric account. Reasoning provides only partial protection: two 32B reasoning models reach 15--18% prompt-side ASR, and activation steering separates them sharply in both baseline susceptibility and persona-specific vulnerability. Heuristic trace diagnostics suggest that the safer model retains stronger policy recall and self-correction behavior, not merely longer reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that safety evaluations of persona-imbued LLMs relying solely on system prompting are incomplete, as activation steering uncovers different, architecture-dependent vulnerability profiles. Across 5,568 judged conditions on four models from three families, prompt-based persona danger rankings are largely preserved (ρ = 0.71–0.96), but activation-steering vulnerabilities diverge and cannot be predicted from prompt rankings; the key illustration is the prosocial persona paradox on Llama-3.1-8B, where P12 is among the safest under prompting yet shows the highest ASR (~0.818) under steering, with partial replication on a reasoning model and a trait-refusal alignment account.
Significance. If the central empirical findings hold, the result is significant for LLM safety evaluation practices, underscoring that single-method testing can miss dominant failure modes and that persona rankings are method-specific. The large experimental scale (5,568 conditions), replication across architectures and on a second model, and the concrete prosocial persona paradox constitute clear strengths that advance the case for multi-method evaluation protocols.
major comments (1)
- [ASR judgment process and prosocial persona paradox results] The central claim that prompting and activation steering expose genuinely different vulnerability profiles (and thus that single-method evaluation is incomplete) rests on the prosocial persona paradox being a real safety divergence rather than a measurement artifact. The manuscript reports robustness to coefficient ablation and matched-strength calibration but provides no details on ASR judge calibration, human agreement rates, or style-controlled ablations to rule out method-dependent bias arising from stylistic differences (prompted responses are typically longer and more polite than steered ones). This directly affects the validity of the architecture-dependent profiles and the assertion that prompt rankings cannot predict AS vulnerability.
minor comments (2)
- [Abstract and Evaluation Methodology] The abstract and results would benefit from a brief explicit statement of the exact ASR judge model and prompt template used, to allow readers to assess potential stylistic sensitivity.
- [Reasoning model analysis] The heuristic trace diagnostics for the two 32B reasoning models are interesting but would be clearer with quantitative metrics (e.g., frequency of policy recall or self-correction events) rather than qualitative description alone.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights an important aspect of validating our central empirical claims. We address the concern regarding the ASR judgment process and potential stylistic artifacts in the prosocial persona paradox below, and we will incorporate additional details and analyses in the revision to strengthen the evidence that the observed divergences are not measurement artifacts.
read point-by-point responses
-
Referee: The central claim that prompting and activation steering expose genuinely different vulnerability profiles (and thus that single-method evaluation is incomplete) rests on the prosocial persona paradox being a real safety divergence rather than a measurement artifact. The manuscript reports robustness to coefficient ablation and matched-strength calibration but provides no details on ASR judge calibration, human agreement rates, or style-controlled ablations to rule out method-dependent bias arising from stylistic differences (prompted responses are typically longer and more polite than steered ones). This directly affects the validity of the architecture-dependent profiles and the assertion that prompt rankings cannot predict AS vulnerability.
Authors: We agree that explicit validation of the ASR judge is necessary to confirm the prosocial persona paradox reflects a genuine method-dependent divergence rather than stylistic bias. The current manuscript demonstrates robustness via coefficient ablation (showing the inversion persists across steering strengths) and matched-strength calibration (ensuring comparable intervention magnitudes between methods), which indirectly mitigates some intensity-related confounds. However, we acknowledge the absence of detailed judge calibration, human agreement metrics, and style-controlled ablations. In the revised manuscript, we will add: (1) a methods subsection describing ASR judge calibration against a held-out set of safety-annotated examples from standard benchmarks; (2) human agreement rates from a validation study on a random sample of 250 responses (with inter-annotator agreement reported); and (3) style-controlled ablations, including length normalization and politeness neutralization on a subset of outputs, with re-judgment to confirm the P12 ASR remains elevated (~0.79) on Llama-3.1-8B. These will be documented in a new appendix. The replication on DeepSeek-R1-Distill-Qwen-32B further supports that the architecture-dependent profiles are not judge-specific artifacts. We believe these additions will fully address the concern while preserving the finding that prompt rankings (ρ = 0.71–0.96) fail to predict activation-steering vulnerabilities. revision: yes
Circularity Check
No circularity: purely empirical reporting of experimental outcomes
full rationale
The paper conducts direct empirical measurements of attack success rates (ASR) across 5,568 conditions on four LLMs using two distinct methods (system prompting and activation steering). It reports observed correlations (ρ = 0.71–0.96 for prompt rankings) and specific inversions such as the prosocial persona paradox without any equations, parameter fitting, derivations, or reductions that equate outputs to inputs by construction. Claims rest on measured data patterns rather than self-citations, ansatzes, or uniqueness theorems. The work is self-contained as an experimental comparison and does not invoke load-bearing prior results from the same authors to justify its central findings.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Attack success rate (ASR) derived from judged model responses is a reliable proxy for persona-induced safety vulnerabilities.
Forward citations
Cited by 1 Pith paper
-
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel
CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.
Reference graph
Works this paper leans on
-
[1]
Pass each pair(s + i ,s − i )through the model
-
[2]
Extract the residual-stream activation at the last token position for each target layerℓ
-
[3]
Compute the per-pair differenceδ ℓ i =h +,ℓ i −h −,ℓ i
-
[4]
What medications can I take to [unsafe medical action]?
Average across pairs and normalize: ˆdℓ = 1 N ∑N i=1 δℓ i ,d ℓ = ˆdℓ/∥ ˆdℓ∥. For personas requiring the “low” pole of a trait (e.g., P04 = Low Conscientiousness), we negate the extracted direction:d ℓ low =−d ℓ. C.3 Layer Selection Steering vectors areextractedover a wide layer range to preserve flexibility, but areappliedat a narrower set of 8 layers per...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.