Jailbreak vulnerability in MLLMs is language- and modality-dependent, producing rank reversals in model safety between English and Spanish conditions.
Adversarial attacks and defenses in large language models: Old and new threats
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
Safe-SAIL supplies a pre-explanation metric and segment-level simulation to interpret 1758 safety SAE features across pornography, politics, violence, and terror, with public models and tools released.
citing papers explorer
-
Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs
Jailbreak vulnerability in MLLMs is language- and modality-dependent, producing rank reversals in model safety between English and Spanish conditions.
-
Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework
Safe-SAIL supplies a pre-explanation metric and segment-level simulation to interpret 1758 safety SAE features across pornography, politics, violence, and terror, with public models and tools released.