Multi-turn jailbreak attacks on medical AI increase unsafe responses from 35% to 80% by turn 4, expose 19x model gaps invisible in single-turn tests, and a lightweight classifier reduces unsafe outputs by 52 points at the cost of 45% false alarms on benign queries.
and Afreen, Samina and Blasko, Barbara and Brazile, Tiffany L
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
MultiTurnPSB: Evaluating Multi-Turn Jailbreak Attacks an dClassifier-Based Defenses for Medical AI Safety
Multi-turn jailbreak attacks on medical AI increase unsafe responses from 35% to 80% by turn 4, expose 19x model gaps invisible in single-turn tests, and a lightweight classifier reduces unsafe outputs by 52 points at the cost of 45% false alarms on benign queries.