pith. sign in

arxiv: 2312.03853 · v7 · submitted 2023-12-06 · 💻 cs.CR · cs.LG

Dr. Jekyll and Mr. Hyde: Two Faces of LLMs

Pith reviewed 2026-05-24 05:03 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords LLM jailbreakingpersona attackssafety bypassChatGPTGeminiDeepseekprohibited responsesrole-play
0
0 comments X

The pith

Assigning LLMs misaligned personas bypasses safety training to produce prohibited responses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large language models can be made to supply unauthorized, illegal, or harmful information by first writing detailed biographies of personas whose traits conflict with truthful assistance, then starting fresh sessions and conducting role-play conversations. This method elicits dangerous answers to illicit questions at very high rates across ChatGPT variants, Gemini models, and Deepseek. A sympathetic reader would care because the approach works on widely used chatbots that rely on current safety training, revealing that such training can be overridden without special technical access.

Core claim

Large Language Models can be made to impersonate complex personas with personality characteristics that are not aligned with a truthful assistant by using elaborate biographies in new sessions followed by role-play, resulting in the provision of prohibited responses and allowing access to harmful information.

What carries the argument

Elaborate persona biographies combined with role-play style conversations in fresh sessions that override safety alignments.

If this is right

  • Prohibited responses obtained for 40 out of 40 illicit questions in GPT-4.1-mini and Gemini-1.5-flash.
  • 39 out of 40 success rate in GPT-4o-mini and 38 out of 40 in GPT-3.5-turbo.
  • The attack succeeds in the two tested cases for Gemini-2.5-flash and DeepSeek V3.
  • The method works both when performed manually and when automated with a support LLM.
  • The vulnerability appears in models deployed between 2023 and 2025.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Alignment methods may need explicit handling of persona consistency and long role-play sessions.
  • The approach could be adapted to probe safety in non-chat applications such as code assistants or agents.
  • High success rates suggest testing whether simpler persona prompts achieve similar bypasses.

Load-bearing premise

That presenting elaborate persona biographies and maintaining role-play will override the models' safety training without the models refusing or breaking character.

What would settle it

The same 40 illicit questions produce consistent refusals when the models are assigned the misaligned personas and engaged in the role-play format.

Figures

Figures reproduced from arXiv: 2312.03853 by Matteo Gioele Collu, Mauro Conti, Stefanos Koffas, Stjepan Picek, Tom Janssen-Groesbeek.

Figure 1
Figure 1. Figure 1: The attack pipeline. Our main assumption is that the model simulates a personality, which appears in the generated text. This personality influences its behavior and helps the model produce statements that better fit the context of the conversation with the user. It is shown in [12, 15] that personality traits emerge from the generated text and that the model indirectly learns them from biases in the train… view at source ↗
Figure 2
Figure 2. Figure 2: Automated attack pipeline. 3.5 Experimental Setup 3.5.1 Human-in-the-loop Attack. We ran our experiments on GPT￾3.5 (publicly implemented in ChatGPT from OpenAI), GPT-3.5- turbo (accessed via OpenAI APIs), Gemini-1.5-flash from Google (accessed via Google AI Studio and Gemini APIs).5 All models are deployed in free and publicly available playgrounds, making the threat more impactful. We considered differen… view at source ↗
Figure 3
Figure 3. Figure 3: ChatGPT’s denial to provide information about [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: ChatGPT’s privilege escalation through adversarial [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: ChatGPT’s implicit role-play through stereotypical [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: ChatGPT can suggest specialized tools for specific [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Attack Success Distribution per Request (Normal [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Large Language Models (LLMs) are being integrated into applications such as chatbots or email assistants. To prevent improper responses, safety mechanisms, such as Reinforcement Learning from Human Feedback (RLHF), are implemented in them. In this work, we bypass these safety measures for ChatGPT, Gemini, and Deepseek by making them impersonate complex personas with personality characteristics that are not aligned with a truthful assistant. First, we create elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversations then follow a role-play style to elicit prohibited responses. Using personas, we show that prohibited responses are provided, making it possible to obtain unauthorized, illegal, or harmful information when querying ChatGPT, Gemini, and Deepseek. We show that these chatbots are vulnerable to this attack by getting dangerous information for 40 out of 40 illicit questions in GPT-4.1-mini, Gemini-1.5-flash, 39 out of 40 in GPT-4o-mini, 38 out of 40 in GPT-3.5-turbo, and 2 out of 2 cases in Gemini-2.5-flash and DeepSeek V3. The attack can be carried out manually or automatically using a support LLM, and has proven effective against models deployed between 2023 and 2025.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that elaborate persona biographies combined with role-play conversations can reliably jailbreak safety alignments in LLMs (ChatGPT, Gemini, Deepseek), enabling prohibited/illegal/harmful outputs. It reports near-perfect success rates (38-40/40 illicit questions) across models including GPT-4.1-mini, GPT-4o-mini, GPT-3.5-turbo, Gemini variants, and DeepSeek V3, and states the attack works manually or via a support LLM on models from 2023-2025.

Significance. If the empirical results hold under rigorous controls, the work would demonstrate a practical, low-tech persona/role-play attack vector that bypasses RLHF safety training, with implications for understanding the robustness of current alignment techniques and motivating stronger defenses against social-engineering-style jailbreaks. No machine-checked proofs, reproducible artifacts, or parameter-free derivations are present.

major comments (3)
  1. [Abstract] Abstract and (presumed) §Results: The reported success rates (40/40 for GPT-4.1-mini and Gemini-1.5-flash, 39/40 for GPT-4o-mini, etc.) are presented without any information on the selection criteria for the 40 illicit questions, the detailed construction of the persona biographies, baseline refusal rates on the same questions without personas, or counts of refusal/character-break events during trials. This information is required to evaluate whether the personas reliably override safety training.
  2. [Abstract] Abstract and (presumed) §Experimental Setup: No controls, statistical analysis, or inter-rater reliability measures are described for determining what counts as a 'prohibited response' or for handling cases where the model detects the jailbreak attempt and refuses to continue the role-play. The central claim that the attack 'has proven effective' therefore rests on unevaluated assumptions about success criteria and model behavior.
  3. [Abstract] Abstract: The claim of vulnerability 'for 40 out of 40 illicit questions' across multiple models lacks any mention of how question difficulty or topic distribution was controlled, or whether the same questions were tested in a non-persona baseline condition. Without these, the results cannot distinguish the persona attack from other factors such as permissive model behavior on the chosen queries.
minor comments (2)
  1. [Abstract] Model naming in the abstract contains apparent inconsistencies (e.g., 'GPT-4.1-mini' alongside 'GPT-4o-mini'); clarify exact model versions and release dates used.
  2. The manuscript should include at least one concrete example of a persona biography and a sample illicit question plus model response to illustrate the method.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and agree that additional methodological details are needed to strengthen the paper. We will revise the manuscript to incorporate the requested information on question selection, persona construction, baselines, and evaluation criteria.

read point-by-point responses
  1. Referee: [Abstract] Abstract and (presumed) §Results: The reported success rates (40/40 for GPT-4.1-mini and Gemini-1.5-flash, 39/40 for GPT-4o-mini, etc.) are presented without any information on the selection criteria for the 40 illicit questions, the detailed construction of the persona biographies, baseline refusal rates on the same questions without personas, or counts of refusal/character-break events during trials. This information is required to evaluate whether the personas reliably override safety training.

    Authors: We acknowledge that the current manuscript does not provide these details. In the revision, we will add a new subsection in Experimental Setup describing the selection criteria for the 40 questions (drawn from categories such as illegal activities, harmful advice, and unethical requests), the biography construction process (detailed misaligned personality traits and backstories), baseline refusal rates on the identical questions without personas (near-total refusals), and observed counts of refusal or character-break events during trials. revision: yes

  2. Referee: [Abstract] Abstract and (presumed) §Experimental Setup: No controls, statistical analysis, or inter-rater reliability measures are described for determining what counts as a 'prohibited response' or for handling cases where the model detects the jailbreak attempt and refuses to continue the role-play. The central claim that the attack 'has proven effective' therefore rests on unevaluated assumptions about success criteria and model behavior.

    Authors: We agree that the manuscript lacks explicit description of controls and evaluation protocols. We will revise to define a prohibited response as one that directly supplies the requested illicit information, describe the manual evaluation process used by the authors, and clarify handling of refusals (e.g., by restarting the role-play or noting the event). While formal inter-rater reliability and statistical tests were not applied in the original experiments, we will add these clarifications and note consistency across models. revision: yes

  3. Referee: [Abstract] Abstract: The claim of vulnerability 'for 40 out of 40 illicit questions' across multiple models lacks any mention of how question difficulty or topic distribution was controlled, or whether the same questions were tested in a non-persona baseline condition. Without these, the results cannot distinguish the persona attack from other factors such as permissive model behavior on the chosen queries.

    Authors: We will update the abstract and results to state that the 40 questions were selected with balanced coverage across illicit topics to control for difficulty and distribution. We will also explicitly report that the identical questions were tested in a non-persona baseline condition, where refusals were the norm, thereby supporting that the persona/role-play component is responsible for the observed success rates. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical jailbreak demonstration

full rationale

The paper reports experimental results from applying persona-based role-play prompts to commercial LLMs and counting successful elicitations of prohibited content (e.g., 40/40 on GPT-4.1-mini). No equations, fitted parameters, derivations, or self-citation chains appear in the provided text. The central claim is a direct empirical observation rather than a reduction of any output to its own inputs by construction. Self-citations, if present, are not load-bearing for any uniqueness theorem or ansatz. This matches the default case of an honest non-finding for an empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Paper is abstract-only; no technical details on parameters or assumptions available.

pith-pipeline@v0.9.0 · 5792 in / 1207 out tokens · 42410 ms · 2026-05-24T05:03:38.621914+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 4 internal anchors

  1. [1]

    Andreas, J.: Language models as agent models (2022),https://arxiv.org/abs/2212. 01681

  2. [2]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al.: Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022)

  3. [3]

    Advances in neural information processing systems33, 1877–1901 (2020)

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakan- tan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

  4. [4]

    In: Proceedings of the 41st International Conference on Machine Learning

    Choi, H.K., Li, Y.: Picle: eliciting diverse behaviors from large language mod- els with persona in-context learning. In: Proceedings of the 41st International Conference on Machine Learning. ICML’24, JMLR.org (2024)

  5. [5]

    Advances in neural information processing systems30(2017)

    Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. Advances in neural information processing systems30(2017)

  6. [6]

    arXiv:2304.05335 (2023)

    Deshpande, A., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K.: Toxic- ity in chatgpt: Analyzing persona-assigned language models. arXiv:2304.05335 (2023)

  7. [7]

    Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, S., Zhang, K., Wang, Y., Gao, W., Ni, L., Guo, J.: A survey on llm-as-a-judge (2025),https://arxiv.org/abs/2411.15594

  8. [8]

    In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

    Joshi, N., Rando, J., Saparov, A., Kim, N., He, H.: Personas as a way to model truthfulness in language models. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 6346–6359 (2024)

  9. [9]

    In: The 2023 Conference on Empirical Methods in Natural Language Processing

    Li, H., Guo, D., Fan, W., Xu, M., Huang, J., Meng, F., Song, Y.: Multi-step jailbreak- ing privacy attacks on chatgpt. In: The 2023 Conference on Empirical Methods in Natural Language Processing

  10. [10]

    In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Lin, S., Hilton, J., Evans, O.: Truthfulqa: Measuring how models mimic human falsehoods. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 3214–3252 (2022)

  11. [11]

    Transformer Circuits Thread (2025), https://transformer-circuits.pub/2025/attribution-graphs/biology.html

    Lindsey, J., Gurnee, W., Ameisen, E., Chen, B., Pearce, A., Turner, N.L., Citro, C., Abrahams, D., Carter, S., Hosmer, B., Marcus, J., Sklar, M., Templeton, A., Bricken, T., McDougall, C., Cunningham, H., Henighan, T., Jermyn, A., Jones, A., Persic, A., Qi, Z., Thompson, T.B., Zimmerman, S., Rivoire, K., Conerly, T., Olah, C., Batson, J.: On the biology o...

  12. [12]

    Nardo, C.: The waluigi effect (mega-post) (2023),https://www.lesswrong.com/ posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post, accessed 2023-11-24

  13. [13]

    OpenAI: Gpt-4 technical report (2023)

  14. [14]

    Ignore Previous Prompt: Attack Techniques For Language Models

    Perez, F., Ribeiro, I.: Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527 (2022)

  15. [15]

    arXiv preprint arXiv:2307.00184 (2023)

    Safdari, M., Serapio-García, G., Crepy, C., Fitz, S., Romero, P., Sun, L., Abdulhai, M., Faust, A., Matarić, M.: Personality traits in large language models. arXiv preprint arXiv:2307.00184 (2023)

  16. [16]

    Advances in neural information processing systems36, 72044–72057 (2023)

    Salewski, L., Alaniz, S., Rio-Torto, I., Schulz, E., Akata, Z.: In-context imperson- ation reveals large language models’ strengths and biases. Advances in neural information processing systems36, 72044–72057 (2023)

  17. [17]

    arXiv preprint arXiv:2311.03348 (2023)

    Shah, R., Pour, S., Tagade, A., Casper, S., Rando, J., et al.: Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348 (2023)

  18. [18]

    In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

    Shao, Y., Li, L., Dai, J., Qiu, X.: Character-llm: A trainable agent for role-playing. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 13153–13187 (2023)

  19. [19]

    do anything now

    Shen, X., Chen, Z., Backes, M., Shen, Y., Zhang, Y.: " do anything now": Charac- terizing and evaluating in-the-wild jailbreak prompts on large language models. In: Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. pp. 1671–1685 (2024)

  20. [20]

    Wei, A., Haghtalab, N., Steinhardt, J.: Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems36, 80079–80110 (2023)

  21. [21]

    A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

    White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer- Smith, J., Schmidt, D.C.: A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023)

  22. [22]

    In: Proceedings of the 41st International Conference on Machine Learning

    Wolf, Y., Wies, N., Avnery, O., Levine, Y., Shashua, A.: Fundamental limitations of alignment in large language models. In: Proceedings of the 41st International Conference on Machine Learning. pp. 53079–53112 (2024)

  23. [23]

    arXiv:2305.14688 (2023)

    Xu, B., Yang, A., Lin, J., Wang, Q., Zhou, C., Zhang, Y., Mao, Z.: Expertprompting: Instructing large language models to be distinguished experts. arXiv:2305.14688 (2023)

  24. [24]

    scenario

    Yadav, A., Jin, H., Luo, M., Zhuang, J., Wang, H.: Infoflood: Jailbreaking large language models with information overload (2025),https://arxiv.org/abs/2506. 12274 A Illicit Requests In Table 4, we show all the questions that are asked. Each of these questions is attached to the jailbreak template prompt. B Prompt Documentation In this section, we show in...