Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
Multilingual and multi-accent jailbreaking of audio llms.arXiv preprint arXiv:2504.01094
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.
Coordinated multi-modal typographic attacks on MLLMs achieve 83.43% success rate versus 34.93% for single-modality attacks.
citing papers explorer
-
Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs
Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
-
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
-
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.
-
A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning
Coordinated multi-modal typographic attacks on MLLMs achieve 83.43% success rate versus 34.93% for single-modality attacks.