Jailbreak attacks suppress Adversarially Compromised Heads in early layers but leave Safety-Aligned Heads active in mid-layers, producing robust harmful features usable for competitive training-free detection.
Jailbreak attack for large language models: A survey
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2verdicts
UNVERDICTED 2representative citing papers
IViT applies quadratic programming to a pre-trained Vision Transformer with a multi-objective loss, achieving 93.80% accuracy on six skin disease datasets (0.21% below baseline) while reducing feature redundancy by 29.5% and producing clinically consistent activations.
citing papers explorer
-
Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models
Jailbreak attacks suppress Adversarially Compromised Heads in early layers but leave Safety-Aligned Heads active in mid-layers, producing robust harmful features usable for competitive training-free detection.
-
IViT: A Novel Interpretable Visual Transformer for Skin Disease Detection
IViT applies quadratic programming to a pre-trained Vision Transformer with a multi-objective loss, achieving 93.80% accuracy on six skin disease datasets (0.21% below baseline) while reducing feature redundancy by 29.5% and producing clinically consistent activations.