A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
arXiv preprint arXiv:2601.05445
2 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CR 2years
2026 2roles
background 1polarities
background 1representative citing papers
Proposes the TRIAD framework that treats multi-turn multimodal attacks as continuous trajectories and uses structural anomaly detection, regularized Mahalanobis distance, topological acceleration, and a time-varying Cox model with Bayesian HMM feedback to predict and bound expected time-to-failure.
citing papers explorer
-
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration
A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
-
Surviving the Unseen: Predictive Defense for Novel Multi-Turn Multimodal Attacks
Proposes the TRIAD framework that treats multi-turn multimodal attacks as continuous trajectories and uses structural anomaly detection, regularized Mahalanobis distance, topological acceleration, and a time-varying Cox model with Bayesian HMM feedback to predict and bound expected time-to-failure.