Jailbreak attack for large language models: A survey

Li, N · 2024 · arXiv 1239.202330

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models

cs.CR · 2026-06-26 · unverdicted · novelty 5.0

Jailbreak attacks suppress Adversarially Compromised Heads in early layers but leave Safety-Aligned Heads active in mid-layers, producing robust harmful features usable for competitive training-free detection.

IViT: A Novel Interpretable Visual Transformer for Skin Disease Detection

eess.IV · 2026-06-22 · unverdicted · novelty 4.0

IViT applies quadratic programming to a pre-trained Vision Transformer with a multi-objective loss, achieving 93.80% accuracy on six skin disease datasets (0.21% below baseline) while reducing feature redundancy by 29.5% and producing clinically consistent activations.

citing papers explorer

Showing 2 of 2 citing papers.

Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models cs.CR · 2026-06-26 · unverdicted · none · ref 24
Jailbreak attacks suppress Adversarially Compromised Heads in early layers but leave Safety-Aligned Heads active in mid-layers, producing robust harmful features usable for competitive training-free detection.
IViT: A Novel Interpretable Visual Transformer for Skin Disease Detection eess.IV · 2026-06-22 · unverdicted · none · ref 32
IViT applies quadratic programming to a pre-trained Vision Transformer with a multi-objective loss, achieving 93.80% accuracy on six skin disease datasets (0.21% below baseline) while reducing feature redundancy by 29.5% and producing clinically consistent activations.

Jailbreak attack for large language models: A survey

fields

years

verdicts

representative citing papers

citing papers explorer