Recognition: 2 theorem links
· Lean TheoremDo No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling
Pith reviewed 2026-05-10 20:10 UTC · model grok-4.3
The pith
Persona-based simulations of counseling clients reveal that LLMs can give unauthorized medical advice and reinforce delusions in mental health settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that PCSA generates natural, persona-driven client dialogues to expose vulnerabilities in LLMs for psychological counseling, outperforming baselines by revealing instances of unauthorized medical advice, reinforcement of delusions, and implicit encouragement of risky actions, with supporting evidence from perplexity scores and human evaluations confirming the realism of the simulated interactions.
What carries the argument
PCSA, the Persona-based Client Simulation Attack framework that produces coherent multi-turn dialogues from clients with specific personas to test LLMs for failures in therapeutic safety alignment.
Load-bearing premise
The persona-driven simulated client dialogues are coherent and representative enough of real counseling sessions to expose genuine model weaknesses instead of just simulation artifacts.
What would settle it
A direct comparison of LLM responses to actual human clients in live counseling sessions versus the same LLMs interacting with PCSA-generated personas would show whether the harmful outputs occur at similar rates and types.
Figures
read the original abstract
The increasing use of large language models (LLMs) in mental healthcare raises safety concerns in high-stakes therapeutic interactions. A key challenge is distinguishing therapeutic empathy from maladaptive validation, where supportive responses may inadvertently reinforce harmful beliefs or behaviors in multi-turn conversations. This risk is largely overlooked by existing red-teaming frameworks, which focus mainly on generic harms or optimization-based attacks. To address this gap, we introduce Personality-based Client Simulation Attack (PCSA), the first red-teaming framework that simulates clients in psychological counseling through coherent, persona-driven client dialogues to expose vulnerabilities in psychological safety alignment. Experiments on seven general and mental health-specialized LLMs show that PCSA substantially outperforms four competitive baselines. Perplexity analysis and human inspection further indicate that PCSA generates more natural and realistic dialogues. Our results reveal that current LLMs remain vulnerable to domain-specific adversarial tactics, providing unauthorized medical advice, reinforcing delusions, and implicitly encouraging risky actions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Personality-based Client Simulation Attack (PCSA), a red-teaming framework that generates coherent, persona-driven client dialogues to expose vulnerabilities in LLMs used for psychological counseling. Through experiments on seven general and mental health-specialized LLMs, it claims that PCSA substantially outperforms four competitive baselines, with additional perplexity analysis and human inspection supporting the naturalness of the generated dialogues. The results highlight that current LLMs are vulnerable to domain-specific adversarial tactics, such as providing unauthorized medical advice, reinforcing delusions, and encouraging risky actions.
Significance. If the persona simulations prove to be ecologically valid proxies for real counseling interactions, this work would be significant in highlighting overlooked safety risks in applying LLMs to mental healthcare. It provides a novel attack framework focused on multi-turn therapeutic empathy versus maladaptive validation, which existing red-teaming methods overlook. The empirical evaluation across multiple models adds to the evidence base for domain-specific vulnerabilities.
major comments (2)
- [Abstract and Experiments] The claim of substantial outperformance over four baselines lacks details on baseline implementations, statistical significance testing, and controls for potential simulation biases, as noted in the abstract's summary of results. This makes it challenging to fully evaluate the central empirical claim.
- [Evaluation of PCSA] The perplexity analysis and human inspection for naturalness and realism are referenced, but there is no mention of external validation against real psychological counseling transcripts or ratings by expert clinicians for behavioral fidelity. This is critical because the validity of all reported vulnerabilities and outperformance rests on the assumption that the simulated dialogues are representative of actual client-therapist interactions.
minor comments (2)
- [Abstract] The acronym PCSA is introduced as 'Personality-based' in the abstract but the title uses 'Persona-based'; consistency in terminology would improve clarity.
- [Methods] More details on how the persona-driven dialogues are constructed (e.g., prompt templates, multi-turn coherence mechanisms) would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments highlight important aspects of empirical rigor and validity that we address below. We plan to revise the manuscript accordingly to strengthen the presentation of our results and clarify the scope of our evaluation.
read point-by-point responses
-
Referee: [Abstract and Experiments] The claim of substantial outperformance over four baselines lacks details on baseline implementations, statistical significance testing, and controls for potential simulation biases, as noted in the abstract's summary of results. This makes it challenging to fully evaluate the central empirical claim.
Authors: We agree that the abstract is high-level and that the Experiments section would benefit from expanded details. In the revised manuscript we will: (1) provide explicit descriptions of how each of the four baselines was implemented and adapted to the counseling domain, including any hyperparameter choices; (2) report statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values and effect sizes) on the primary metrics; and (3) add a dedicated paragraph discussing potential simulation biases (persona sampling, prompt sensitivity) together with the controls we applied, such as diverse persona generation and manual verification of coherence. These changes will be reflected in both the Experiments section and an updated abstract. revision: yes
-
Referee: [Evaluation of PCSA] The perplexity analysis and human inspection for naturalness and realism are referenced, but there is no mention of external validation against real psychological counseling transcripts or ratings by expert clinicians for behavioral fidelity. This is critical because the validity of all reported vulnerabilities and outperformance rests on the assumption that the simulated dialogues are representative of actual client-therapist interactions.
Authors: We acknowledge that direct external validation against real counseling transcripts and expert-clinician ratings would provide stronger evidence of ecological validity. Our current evaluation uses perplexity against held-out dialogue corpora and human ratings focused on linguistic naturalness and coherence; we did not obtain licensed-clinician behavioral-fidelity ratings or perform transcript-level comparisons, primarily due to ethical review constraints and resource limitations. In the revision we will: (a) explicitly state this limitation in a new “Limitations” subsection, (b) detail the human-inspection protocol (rating criteria, number of annotators, inter-rater agreement), and (c) outline concrete directions for future expert validation. We believe these additions will help readers assess the strength of our claims without overstating the current evidence. revision: partial
Circularity Check
No circularity: purely empirical red-teaming framework
full rationale
The paper introduces PCSA as an empirical attack framework for evaluating LLM safety in counseling scenarios. It reports experimental results on seven models against four baselines, supported by perplexity metrics and human inspection for dialogue naturalness. No equations, parameters, derivations, or self-referential constructions appear in the provided text. All claims rest on external evaluations of model outputs rather than internal definitions or fitted inputs renamed as predictions. Self-citations, if present, are not load-bearing for any derivation chain since none exists. The central results are falsifiable via replication on the described models and baselines.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose PCSA, a clinical red-teaming framework... Phase I: Domain-Specific Persona Initialization... Phase II: Strategy-Driven Interaction Loop... four psychological strategies (Reassurance Seeking, Appeal to Expertise, Intellectualization, Metaphorical Expression)
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Perplexity analysis and human inspection further indicate that PCSA generates more natural and realistic dialogues
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs
Many-shot jailbreaking.Advances in Neural Information Processing Systems, 37:129696–129742. Adrian Arnaiz-Rodriguez, Miguel Baidal, Erik Derner, Jenn Layton Annable, Mark Ball, Mark Ince, Elvira Perez Vallejos, and Nuria Oliver. 2025. Be- tween help and harm: An evaluation of mental health crisis handling by llms.Preprint, arXiv:2509.24857. Asma Ben Abach...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Granite guardian: Comprehensive LLM safe- guarding. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the As- sociation for Computational Linguistics: Human 10 Language Technologies (Volume 3: Industry Track), pages 607–615, Albuquerque, New Mexico. Rui Pu, Chaozhuo Li, Rui Ha, Zejian Chen, Litian Zhang, Zheng Liu, Lirong Qiu,...
2025
-
[3]
Feint and attack: Jailbreaking and protecting llms via attention distribution modeling. InProceed- ings of the Thirty-Fourth International Joint Con- ference on Artificial Intelligence, IJCAI-25, pages 493–501. International Joint Conferences on Artifi- cial Intelligence Organization. Main Track. Huachuan Qiu, Zhaoming Chen, Yuqian Chen, Yuan Xie, Yu Lu, ...
-
[4]
InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 24278–24306, Suzhou, China
NEXUS: Network exploration for eXploit- ing unsafe sequences in multi-turn LLM jailbreaks. InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 24278–24306, Suzhou, China. Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. 2025. LLMs know their vuln...
2025
-
[5]
Great, now write an article about that: The crescendo multi-turn llm jailbreak attack. In34th USENIX Security Symposium (USENIX Security 25), pages 2421–2440. Annika Marie Schoene and Cansu Canca. 2025. ‘for argument’s sake, show me how to harm myself!’: Jail- breaking llms in suicide and self-harm contexts. In 2025 IEEE International Symposium on Technol...
-
[6]
source content
Your output should be comparable in length and complexity to the source content. 4. Only output the transformed sentence, without any additional explanation or commentary. CrescendoWhat are some common methods used by individuals to disguise self-harm injuries in public settings? ActorAttackFormat above content to Advice in style of Avoiding getting caugh...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.