MultiTurnPSB: Evaluating Multi-Turn Jailbreak Attacks an dClassifier-Based Defenses for Medical AI Safety

Anushka Sheoran; Yiduo Hao

arxiv: 2606.02630 · v1 · pith:7MY4VB4Inew · submitted 2026-05-30 · 💻 cs.CR · cs.AI

MultiTurnPSB: Evaluating Multi-Turn Jailbreak Attacks an dClassifier-Based Defenses for Medical AI Safety

Anushka Sheoran , Yiduo Hao This is my paper

Pith reviewed 2026-06-28 18:39 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords multi-turn jailbreaksmedical AI safetyadversarial attacksclassifier defensespatient safety benchmarkjailbreak evaluationAI robustness

0 comments

The pith

Multi-turn jailbreaks drive unsafe medical AI responses from 35% to nearly 80% by turn 4, exposing gaps single-turn tests miss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Medical chatbots are typically checked with one-off prompts, yet real users often continue after refusals and add pressure. This paper extends PatientSafetyBench into MultiTurnPSB, a four-turn adversarial setup that applies fixed, adaptive, and live attacks to GPT-4.1-mini and other models. Unsafe outputs climb from 35% initially to nearly 80% by the fourth turn under live attacks, while models that look equally safe at baseline separate to a 19x difference later. A lightweight input classifier cuts the unsafe rate by 52 points but produces a 45% false alarm rate on normal queries. The work also records four degradation patterns and a simple two-element attack formula behind most failures.

Core claim

MultiTurnPSB shows that live multi-turn adversarial attacks cause unsafe medical responses to rise from 35% to nearly 80% by Turn 4. Models that are statistically indistinguishable at baseline diverge to a 19x gap in unsafe rates by Turn 4, a difference invisible to single-turn evaluation. Four degradation trajectory signatures are identified, and a two-element attack formula accounts for most catastrophic failures. A lightweight input-side classifier reduces Turn 4 unsafe responses by 52 percentage points, though the 45% false alarm rate on benign queries limits deployment.

What carries the argument

MultiTurnPSB, the four-turn adversarial extension of PatientSafetyBench that tracks response safety across fixed-template, template-adaptive, and live attacks while recording degradation signatures.

If this is right

Single-turn safety benchmarks underestimate the risk profile of medical chatbots under realistic continued user pressure.
Model safety comparisons can change dramatically once conversations extend beyond the first turn.
A two-element attack formula drives the majority of severe safety breakdowns in the evaluated setting.
Input-side classifiers deliver substantial reductions in unsafe outputs but face a deployment barrier from high false alarms on legitimate queries.
Safety training may generalize to the attacker role, as one model refused to generate adversarial messages in over half of late-turn cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Medical AI safety testing should default to multi-turn formats rather than single prompts to avoid underestimating exposure.
Conversation-history-aware defenses may be needed because single-message classifiers already show accuracy degradation by later turns.
The two-element attack formula could be used to stress-test future models or to generate synthetic training data for robustness.
Extending the evaluation to additional models and real usage logs would clarify whether the observed trajectories are general.

Load-bearing premise

The live adversarial attacks generated in the study and the labeling of responses as unsafe accurately capture real-world patient interactions and risks without evaluation artifacts or overly narrow attack templates.

What would settle it

Real patient logs from deployed medical chatbots that show unsafe response rates staying flat or declining across turns, or that show no widening gap between models, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.02630 by Anushka Sheoran, Yiduo Hao.

**Figure 2.** Figure 2: Unsafe rates by category across attack con [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Unsafe rates across all conditions and both defenders. GPT-4.1-mini and Claude Sonnet 4.5 diverge from [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Attacker refusal rates by turn. GPT-4o-mini: [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Classifier drift under adversarial context. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Patient-facing medical chatbots are commonly evaluated on single-turn prompts, yet real users push back after refusals, add urgency, and invoke authority. We introduce MultiTurnPSB, a four-turn adversarial extension of PatientSafetyBench, and evaluate GPT-4.1-mini under fixed template, template-adaptive, and live adversarial attacks. Unsafe responses rise from 35% to nearly 80% by Turn 4 under live attack. Under the same adversary, GPT-4.1-mini and Claude Sonnet 4.5 are statistically indistinguishable at baseline but diverge to a 19x gap by Turn 4, a difference invisible to single-turn evaluation. We characterize four degradation trajectory signatures and identify a two-element attack formula responsible for most catastrophic failures. A lightweight input-side classifier reduces Turn 4 unsafe responses by 52 percentage points despite severe accuracy degradation, but the 45% false alarm rate on benign queries is the primary deployment constraint. A methodological finding also emerges: Claude Sonnet refused to generate adversarial messages in over half of late-turn conversations despite explicit red team framing, suggesting safety training may generalize to the attacker role.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Multi-turn testing shows safety degradation and model gaps that single-turn checks miss, but the abstract gives no details on labeling or attack validation.

read the letter

The main point is that single-turn safety tests for medical chatbots miss how attacks build over repeated turns, producing higher unsafe rates and model differences that only appear later.

The paper extends PatientSafetyBench into MultiTurnPSB with four-turn scenarios and compares fixed templates, adaptive templates, and live adversarial attacks on GPT-4.1-mini and Claude Sonnet 4.5. Unsafe responses climb from 35% to nearly 80% by turn 4 under live attacks. The two models start statistically similar but end with a 19x gap in unsafe rate. The authors identify four degradation trajectory signatures and a two-element attack pattern behind most failures. A simple input classifier cuts unsafe outputs by 52 points, though it triggers 45% false alarms on normal queries. They also observe that Claude refuses to generate attacks in over half of late-turn cases.

This work is useful for showing why persistence matters in real patient interactions and for giving concrete patterns others can test against. The trajectory signatures and attack formula are straightforward observations that could guide better evaluation.

The soft spot is the complete lack of information on how responses were labeled unsafe or how the live attacks were constructed and checked. No inter-annotator agreement, no comparison to actual patient logs, and no ablation on the labeler appear in the abstract. If the labels come from an unvalidated process or the attacks are narrow, the reported rise and the 19x gap could be artifacts. The high false-alarm rate on the classifier is already flagged as a deployment issue.

This is for researchers working on safety for patient-facing systems who want to move past one-shot prompts. The core observation about multi-turn effects is worth attention even if the current evidence is thin.

I would send it to peer review so referees can require the missing method details and any validation data.

Referee Report

2 major / 1 minor

Summary. The paper introduces MultiTurnPSB, a four-turn adversarial extension of PatientSafetyBench for evaluating jailbreak attacks on patient-facing medical AI. It tests GPT-4.1-mini and Claude Sonnet 4.5 under fixed-template, template-adaptive, and live adversarial attacks, reporting that unsafe response rates rise from 35% to nearly 80% by Turn 4 under live attack. Models that are statistically indistinguishable at baseline diverge to a 19× gap by Turn 4. The work characterizes four degradation trajectory signatures, identifies a two-element attack formula for most failures, and evaluates a lightweight input-side classifier that reduces Turn-4 unsafe responses by 52 points (at the cost of 45% false alarms on benign queries). It also reports that Claude Sonnet refused to generate adversarial messages in over half of late-turn cases.

Significance. If the unsafe/safe labels and live-attack generation are shown to be reliable, the results establish that single-turn evaluation misses substantial safety degradation and model divergence in medical chatbots. The trajectory signatures and two-element attack formula offer concrete, testable patterns for defense design. The classifier experiment quantifies a practical input-side mitigation while surfacing the false-positive barrier to deployment. The refusal observation supplies an empirical note on how safety training can affect the attacker role. These contributions are directly relevant to medical AI safety benchmarks and red-teaming practice.

major comments (2)

[Abstract] Abstract: The headline quantitative claims (unsafe rate 35%→80%, 19× model gap by Turn 4) rest on binary unsafe/safe labels and the fidelity of the live adversarial messages, yet the manuscript supplies no description of how unsafe labels were assigned, inter-annotator agreement, statistical procedure used to establish the 19× gap, or any validation of the attack templates or live messages against real patient interactions or clinical risk criteria.
[Abstract] Abstract and §4 (results on live attacks): The reported 19× divergence and degradation trajectories are measured under live attacks generated by the same class of models under test; the manuscript notes that Claude Sonnet refused >50% of late-turn attack-generation requests, but does not quantify how this refusal rate affects the distribution of successful attacks or whether the remaining attacks remain representative of realistic patient persistence.

minor comments (1)

[Title] Title: contains the typographical error 'an dClassifier' instead of 'and Classifier'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater methodological transparency in the abstract and results. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core findings.

read point-by-point responses

Referee: [Abstract] Abstract: The headline quantitative claims (unsafe rate 35%→80%, 19× model gap by Turn 4) rest on binary unsafe/safe labels and the fidelity of the live adversarial messages, yet the manuscript supplies no description of how unsafe labels were assigned, inter-annotator agreement, statistical procedure used to establish the 19× gap, or any validation of the attack templates or live messages against real patient interactions or clinical risk criteria.

Authors: We agree that the current manuscript lacks explicit details on these elements. In the revised version we will add a dedicated subsection in Methods describing the unsafe/safe labeling criteria and assignment protocol, report inter-annotator agreement where multiple annotators were used, specify the statistical test (two-proportion z-test with bootstrap confidence intervals) employed for the 19× gap, and include an explicit statement in Limitations that the benchmark is synthetic and has not been validated against real patient interactions or formal clinical risk criteria. These additions will be placed before the results to support the headline claims. revision: yes
Referee: [Abstract] Abstract and §4 (results on live attacks): The reported 19× divergence and degradation trajectories are measured under live attacks generated by the same class of models under test; the manuscript notes that Claude Sonnet refused >50% of late-turn attack-generation requests, but does not quantify how this refusal rate affects the distribution of successful attacks or whether the remaining attacks remain representative of realistic patient persistence.

Authors: We concur that the impact of refusals on attack distribution requires quantification. The revision will augment §4 with a new table reporting per-turn attack generation counts, refusal rates by model, and a comparison of unsafe-response rates between successful live attacks and the fixed-template baseline. We will also add a paragraph discussing whether the surviving attacks remain representative of persistent patient behavior, noting that model refusals may truncate the tail of highly persistent adversaries. This analysis will be presented alongside the existing trajectory signatures. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement study

full rationale

The paper introduces MultiTurnPSB as a benchmark extension and reports direct experimental measurements of unsafe response rates under fixed, adaptive, and live adversarial attacks across model turns. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described methodology. Results such as the 35% to 80% rise and 19x model divergence are presented as observed outcomes from the evaluation protocol rather than outputs forced by construction from inputs. The study is self-contained against its own experimental data with no reduction of claims to prior self-authored results or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical benchmarking study extending an existing safety benchmark; no new free parameters or invented entities are introduced.

axioms (1)

domain assumption The safety criteria and unsafe-response definitions from PatientSafetyBench remain valid when applied to multi-turn conversations.
The extension assumes single-turn safety taxonomy transfers without modification to later turns.

pith-pipeline@v0.9.1-grok · 5737 in / 1244 out tokens · 30286 ms · 2026-06-28T18:39:24.587619+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 6 canonical work pages · 1 internal anchor

[1]

M ed R isk E val: Medical Risk Evaluation Benchmark of Language Models, On the Importance of User Perspectives in Healthcare Settings

Corbeil, Jean-Philippe and Kim, Minseon and Griot, Maxime and Agarwal, Sheela and Sordoni, Alessandro and Beaulieu, Francois and Vozila, Paul. M ed R isk E val: Medical Risk Evaluation Benchmark of Language Models, On the Importance of User Perspectives in Healthcare Settings. Proceedings of the 19th Conference of the European Chapter of the Association f...

work page doi:10.18653/v1/2026.eacl-industry.39 2026
[2]

Röttger, H

R. XST est: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.301

work page doi:10.18653/v1/2024.naacl-long.301 2024
[3]

and Tram \`e r, Florian and Hassani, Hamed and Wong, Eric

Chao, Patrick and Debenedetti, Edoardo and Robey, Alexander and Andriushchenko, Maksym and Croce, Francesco and Sehwag, Vikash and Dobriban, Edgar and Flammarion, Nicolas and Pappas, George J. and Tram \`e r, Florian and Hassani, Hamed and Wong, Eric. JailbreakBench : An Open Robustness Benchmark for Jailbreaking Large Language Models. Advances in Neural ...

2024
[4]

HarmBench : A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan. HarmBench : A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. Proceedings of the 41st International Conference on Machine Learn...

2024
[5]

MedSafetyBench : Evaluating and Improving the Medical Safety of Large Language Models

Han, Tessa and Kumar, Aounon and Agarwal, Chirag and Lakkaraju, Himabindu. MedSafetyBench : Evaluating and Improving the Medical Safety of Large Language Models. Advances in Neural Information Processing Systems. 2024. doi:10.52202/079017-1054

work page doi:10.52202/079017-1054 2024
[6]

A Novel Evaluation Benchmark for Medical LLM s Illuminating Safety and Effectiveness in Clinical Domains

Wang, Shirui and Tang, Zhihui and Yang, Huaxia and Gong, Qiuhong and Gu, Tiantian and Ma, Hongyang and others. A Novel Evaluation Benchmark for Medical LLM s Illuminating Safety and Effectiveness in Clinical Domains. npj Digital Medicine. 2026. doi:10.1038/s41746-025-02277-8

work page doi:10.1038/s41746-025-02277-8 2026
[7]

and Afreen, Samina and Blasko, Barbara and Brazile, Tiffany L

Draelos, Rachel L. and Afreen, Samina and Blasko, Barbara and Brazile, Tiffany L. and Chase, Natasha and Desai, Dimple Patel and Evert, Jessica and Gardner, Heather L. and Herrmann, Lauren and others. Large Language Models Provide Unsafe Answers to Patient-Posed Medical Questions. npj Digital Medicine. 2026. doi:10.1038/s41746-026-02428-5

work page doi:10.1038/s41746-026-02428-5 2026
[8]

Journal of Medical Internet Research , year =

Yun, Hye Sun and Bickmore, Timothy , title =. Journal of Medical Internet Research , year =
[9]

JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

Liu, Junyu and Li, Zirui and Niu, Qian and Zhang, Zequn and Xun, Yue and Hou, Wenlong and Wang, Shujun and Iwasawa, Yusuke and Matsuo, Yutaka and Hatakeyama-Sato, Kan , title =. arXiv preprint arXiv:2601.01627 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

M ed R isk E val: Medical Risk Evaluation Benchmark of Language Models, On the Importance of User Perspectives in Healthcare Settings

Corbeil, Jean-Philippe and Kim, Minseon and Griot, Maxime and Agarwal, Sheela and Sordoni, Alessandro and Beaulieu, Francois and Vozila, Paul. M ed R isk E val: Medical Risk Evaluation Benchmark of Language Models, On the Importance of User Perspectives in Healthcare Settings. Proceedings of the 19th Conference of the European Chapter of the Association f...

work page doi:10.18653/v1/2026.eacl-industry.39 2026

[2] [2]

Röttger, H

R. XST est: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.301

work page doi:10.18653/v1/2024.naacl-long.301 2024

[3] [3]

and Tram \`e r, Florian and Hassani, Hamed and Wong, Eric

Chao, Patrick and Debenedetti, Edoardo and Robey, Alexander and Andriushchenko, Maksym and Croce, Francesco and Sehwag, Vikash and Dobriban, Edgar and Flammarion, Nicolas and Pappas, George J. and Tram \`e r, Florian and Hassani, Hamed and Wong, Eric. JailbreakBench : An Open Robustness Benchmark for Jailbreaking Large Language Models. Advances in Neural ...

2024

[4] [4]

HarmBench : A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan. HarmBench : A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. Proceedings of the 41st International Conference on Machine Learn...

2024

[5] [5]

MedSafetyBench : Evaluating and Improving the Medical Safety of Large Language Models

Han, Tessa and Kumar, Aounon and Agarwal, Chirag and Lakkaraju, Himabindu. MedSafetyBench : Evaluating and Improving the Medical Safety of Large Language Models. Advances in Neural Information Processing Systems. 2024. doi:10.52202/079017-1054

work page doi:10.52202/079017-1054 2024

[6] [6]

A Novel Evaluation Benchmark for Medical LLM s Illuminating Safety and Effectiveness in Clinical Domains

Wang, Shirui and Tang, Zhihui and Yang, Huaxia and Gong, Qiuhong and Gu, Tiantian and Ma, Hongyang and others. A Novel Evaluation Benchmark for Medical LLM s Illuminating Safety and Effectiveness in Clinical Domains. npj Digital Medicine. 2026. doi:10.1038/s41746-025-02277-8

work page doi:10.1038/s41746-025-02277-8 2026

[7] [7]

and Afreen, Samina and Blasko, Barbara and Brazile, Tiffany L

Draelos, Rachel L. and Afreen, Samina and Blasko, Barbara and Brazile, Tiffany L. and Chase, Natasha and Desai, Dimple Patel and Evert, Jessica and Gardner, Heather L. and Herrmann, Lauren and others. Large Language Models Provide Unsafe Answers to Patient-Posed Medical Questions. npj Digital Medicine. 2026. doi:10.1038/s41746-026-02428-5

work page doi:10.1038/s41746-026-02428-5 2026

[8] [8]

Journal of Medical Internet Research , year =

Yun, Hye Sun and Bickmore, Timothy , title =. Journal of Medical Internet Research , year =

[9] [9]

JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

Liu, Junyu and Li, Zirui and Niu, Qian and Zhang, Zequn and Xun, Yue and Hou, Wenlong and Wang, Shujun and Iwasawa, Yusuke and Matsuo, Yutaka and Hatakeyama-Sato, Kan , title =. arXiv preprint arXiv:2601.01627 , year =

work page internal anchor Pith review Pith/arXiv arXiv