Recognition: unknown
Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs
Pith reviewed 2026-05-10 07:58 UTC · model grok-4.3
The pith
Benign fine-tuning on audio samples can raise jailbreak success rates in audio LLMs from single digits to 87 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Benign fine-tuning elevates Jailbreak Success Rate from single digits to as high as 87.12 percent. The dominant vulnerability axis and the relative risk of audio versus text fine-tuning are both architecture-conditioned, determined by how each model's encoder and projector transform audio into the LLM's input space. Mechanistic analysis on two architectures shows that fine-tuning selectively suppresses the late-layer refusal circuit while the frozen encoder preserves representations, and that even the suppression pattern is architecture-conditioned.
What carries the argument
proximity-based filtering framework that selects benign audio by embedding-space distance to harmful content, decomposed into semantic, acoustic, and mixed axes using external and internal encoders
If this is right
- Benign fine-tuning alone can elevate jailbreak success rates to 87.12 percent without any harmful data present.
- The dominant vulnerability axis varies by model architecture and how audio is mapped into the language model space.
- Filtering training data to maximize distance from harmful embeddings reduces jailbreak success rate to near zero.
- Adding a textual system prompt at inference time reduces jailbreak success rate to near zero.
- Fine-tuning suppresses late-layer refusal circuits while preserving encoder representations in an architecture-specific pattern.
Where Pith is reading between the lines
- Safety evaluations for emerging audio LLMs should routinely test embedding proximity of candidate training samples to harmful content.
- Model designers could prioritize encoder and projector choices that reduce unintended proximity between benign and harmful audio.
- The same proximity mechanism might create parallel risks in other non-text modalities such as video when non-semantic features allow embedding closeness to harmful examples.
- Combining the two proposed defenses could provide layered protection that remains effective even if one defense is bypassed.
Load-bearing premise
Embedding-space proximity to harmful content reliably identifies which benign audio samples will degrade safety upon fine-tuning, and results from the three evaluated models generalize to other audio LLMs.
What would settle it
Fine-tuning an audio LLM exclusively on benign samples selected to maximize distance from harmful embeddings in all three axes and finding that jailbreak success rates remain at single-digit levels.
Figures
read the original abstract
Prior work shows that fine-tuning aligned models on benign data degrades safety in text and vision modalities, and that proximity to harmful content in representation space predicts which samples cause the most damage. However, existing analyses operate within a single, undifferentiated embedding space -- leaving open whether distinct input properties drive the vulnerability differently. Audio introduces a structurally richer problem: a benign sample can neighbor harmful content not only through what is said but through how it sounds, even when its words are entirely innocuous. We present the first systematic study of benign fine-tuning safety in Audio LLMs, evaluating three state-of-the-art models with a proximity-based filtering framework that selects benign audio by embedding-space distance to harmful content. By decomposing proximity into semantic, acoustic, and mixed axes using external reference encoders alongside each model's own internal encoder, we show that benign fine-tuning elevates Jailbreak Success Rate (JSR) from single digits to as high as 87.12%. Crucially, the dominant vulnerability axis and the relative risk of audio versus text fine-tuning are both architecture-conditioned -- determined by how each model's encoder and projector transform audio into the LLM's input space. We propose two defenses: filtering training data to maximize distance from harmful embeddings, and a textual system prompt at inference, both reducing JSR to near-zero without architectural modification. Our mechanistic analysis on two architectures reveals that fine-tuning selectively suppresses the late-layer refusal circuit while the frozen encoder preserves representations, and that even the suppression pattern is architecture-conditioned, mirroring the behavioral asymmetries across modalities. Safety degradation from benign fine-tuning is a qualitatively distinct risk in Audio LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to present the first systematic study of benign fine-tuning safety risks in Audio LLMs. Using a proximity-based filtering framework that decomposes embedding-space distance to harmful content into semantic, acoustic, and mixed axes (via external and internal encoders), it shows that fine-tuning on selected benign audio elevates Jailbreak Success Rate (JSR) from single digits to as high as 87.12% across three models. The dominant vulnerability axis and audio-vs-text risk are architecture-conditioned, determined by each model's encoder and projector. Mechanistic analysis indicates selective suppression of late-layer refusal circuits while preserving encoder representations. Two defenses are proposed: maximizing embedding distance in training data and adding a textual system prompt at inference, both reducing JSR to near zero.
Significance. If the central results hold, this is a significant contribution to multimodal LLM safety, highlighting qualitatively distinct risks in audio due to richer (semantic + acoustic) embedding spaces. Strengths include the multi-axis proximity decomposition, quantified JSR elevations with architecture-specific patterns, mechanistic observations on refusal suppression, and practical defenses that require no architectural changes. The empirical focus provides concrete, falsifiable measurements that advance understanding beyond text/vision modalities.
major comments (3)
- [§4 and §3.2] §4 (Experiments) and §3.2 (Proximity Framework): The core claim that embedding proximity causally drives JSR elevation (and that the dominant axis is architecture-conditioned) depends on the assumption that proximity-selected benign samples are distinct from generic audio fine-tuning. No control arm is reported that fine-tunes on benign audio maximally distant from (or randomly sampled relative to) the same harmful references. Without this, the observed degradation to 87.12% JSR could be a generic effect of audio fine-tuning rather than proximity-driven, weakening both the mechanistic interpretation and the proposed distance-maximizing defense.
- [§3.2] §3.2 and Results tables: Dataset construction details (exact size of harmful reference set, number of benign samples retained per axis after filtering, and selection thresholds) are described at a high level only. This makes it impossible to verify reproducibility of the 87.12% peak JSR or to assess whether the architecture-conditioned differences are robust to reasonable variations in the reference set.
- [Results section] Results section (including any tables reporting JSR): No statistical significance tests, standard deviations, or multi-seed variance are reported for the JSR increases or cross-architecture comparisons. Given the claim that vulnerability is architecture-conditioned, the absence of these metrics leaves open whether observed differences (e.g., audio vs. text risk) exceed noise.
minor comments (3)
- [Abstract] Abstract: The maximum JSR of 87.12% should be attributed to a specific model and proximity axis for immediate clarity.
- [Figures] Figure captions and legends: Ensure all embedding-space visualizations explicitly label the three axes (semantic, acoustic, mixed) and the internal vs. external encoders used.
- [§4.1] §4.1 (Fine-tuning Setup): While hyperparameters are mentioned, providing the exact values (learning rate, epochs, batch size) per model in a table would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their careful reading and valuable suggestions. We agree that the points raised will improve the paper's rigor and reproducibility. We address each major comment in turn and commit to making the necessary revisions.
read point-by-point responses
-
Referee: [§4 and §3.2] §4 (Experiments) and §3.2 (Proximity Framework): The core claim that embedding proximity causally drives JSR elevation (and that the dominant axis is architecture-conditioned) depends on the assumption that proximity-selected benign samples are distinct from generic audio fine-tuning. No control arm is reported that fine-tunes on benign audio maximally distant from (or randomly sampled relative to) the same harmful references. Without this, the observed degradation to 87.12% JSR could be a generic effect of audio fine-tuning rather than proximity-driven, weakening both the mechanistic interpretation and the proposed distance-maximizing defense.
Authors: We agree that a control arm using maximally distant or randomly sampled benign audio would provide stronger causal evidence that proximity, rather than generic fine-tuning, drives the JSR elevation. Our current design isolates effects via axis-specific selection and shows architecture-conditioned patterns across semantic/acoustic/mixed axes, which would be unlikely under a purely generic mechanism. Nevertheless, we will add the requested control experiments in the revised §4, reporting JSR for distant and random benign sets to directly support the proximity interpretation and the distance-maximizing defense. revision: yes
-
Referee: [§3.2] §3.2 and Results tables: Dataset construction details (exact size of harmful reference set, number of benign samples retained per axis after filtering, and selection thresholds) are described at a high level only. This makes it impossible to verify reproducibility of the 87.12% peak JSR or to assess whether the architecture-conditioned differences are robust to reasonable variations in the reference set.
Authors: We acknowledge that §3.2 provides only high-level descriptions. The revised manuscript will expand this section with the precise experimental parameters used, including the exact size of the harmful reference set, the number of benign samples retained per axis after filtering, and the cosine-similarity thresholds applied for each axis. These details will enable independent verification and robustness checks. revision: yes
-
Referee: [Results section] Results section (including any tables reporting JSR): No statistical significance tests, standard deviations, or multi-seed variance are reported for the JSR increases or cross-architecture comparisons. Given the claim that vulnerability is architecture-conditioned, the absence of these metrics leaves open whether observed differences (e.g., audio vs. text risk) exceed noise.
Authors: We recognize the value of statistical reporting for the architecture-conditioned claims. The revised results section and tables will include multi-seed runs (minimum three seeds per condition), means with standard deviations, and appropriate significance tests (e.g., paired t-tests) for JSR differences and cross-architecture comparisons to confirm that the observed patterns exceed experimental noise. revision: yes
Circularity Check
No circularity: purely empirical study with independent experimental results
full rationale
The paper is a systematic empirical evaluation of benign fine-tuning effects on three audio LLMs, measuring changes in Jailbreak Success Rate (JSR) via proximity-based sample selection in embedding spaces and mechanistic analysis of refusal circuits. No derivations, equations, or predictions are presented that reduce by construction to fitted inputs, self-definitions, or self-citation chains. Claims rest on direct experimental outcomes (JSR elevation to 87.12%, architecture-conditioned axes) and proposed defenses, without renaming known results or smuggling ansatzes. Self-citations, if present, are not load-bearing for the central results, which remain falsifiable via the reported controls and external encoders. This is self-contained empirical work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Embedding spaces produced by audio encoders (both model-internal and external reference encoders) meaningfully capture proximity to harmful content that predicts vulnerability to benign fine-tuning.
Reference graph
Works this paper leans on
-
[1]
Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs, 2025
Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms.arXiv preprint arXiv:2502.17424,
-
[2]
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
ISSN 1476-4687. doi: 10.1038/s41586-025-09937-5. URL http://dx.doi.org/10.1038/ s41586-025-09937-5. Suhas BN, Andrew M. Sherrill, Jyoti Alaparthi, Dominik Mattioli, Rosa I. Arriaga, Chris W. Wiese, and Saeed Abdullah. Fine-tuning large audio-language models with lora for 10 Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs precise temporal localiza...
-
[3]
URL https: //arxiv.org/abs/2506.09707. Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE ...
-
[4]
doi: 10.1109/jstsp.2022.3188113
ISSN 1941-0484. doi: 10.1109/jstsp.2022.3188113. URLhttp://dx.doi.org/10.1109/JSTSP.2022.3188113. Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T. Tan, and Haizhou Li. Voicebench: Benchmarking llm-based voice assistants,
-
[5]
URL https://arxiv.org/ abs/2410.17196. Hao Cheng, Erjia Xiao, Jing Shao, Yichi Wang, Le Yang, Chao Shen, Philip Torr, Jindong Gu, and Renjing Xu. Jailbreak-audiobench: In-depth evaluation and analysis of jailbreak threats for large audio language models,
-
[6]
URL https://arxiv.org/abs/2501.13772. Youngwon Choi, Jaeyoon Jung, Hyeonyu Kim, Huu-Kim Nguyen, and Hwayeon Kim. Exploring fine-tuning of large audio language models for spoken language understanding under limited speech data,
-
[7]
Yi Ding, Lijun Li, Bing Cao, and Jing Shao
URLhttps://arxiv.org/abs/2509.15389. Yi Ding, Lijun Li, Bing Cao, and Jing Shao. Rethinking bottlenecks in safety fine-tuning of vision language models,
-
[8]
URLhttps://arxiv.org/abs/2501.18533. Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, and Bryan Catanzaro. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,
-
[9]
Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,
URLhttps://arxiv.org/abs/2507.08128. Zihan Guan, Mengxuan Hu, Ronghang Zhu, Sheng Li, and Anil Vullikanti. Benign samples matter! fine-tuning on outlier benign samples severely breaks safety,
- [10]
-
[11]
Luxi He, Mengzhou Xia, and Peter Henderson
URL https: //arxiv.org/abs/2502.00718. Luxi He, Mengzhou Xia, and Peter Henderson. What is in your safe data? identifying benign data that breaks safety,
-
[12]
URLhttps://arxiv.org/abs/2404.01099. Lei Hsiung, Tianyu Pang, Yung-Chen Tang, Linyue Song, Tsung-Yi Ho, Pin-Yu Chen, and Yaoqing Yang. Why llm safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets,
-
[13]
URL https://arxiv.org/abs/2412.03556. Mintong Kang, Chejian Xu, and Bo Li. Advwave: Stealthy adversarial jailbreak attack against large audio-language models,
-
[14]
URLhttps://arxiv.org/abs/2412.08608. Minseon Kim, Jin Myung Kwak, Lama Alssum, Bernard Ghanem, Philip Torr, David Krueger, Fazl Barez, and Adel Bibi. Rethinking safety in llm fine-tuning: An optimization perspective,
-
[15]
URLhttps://arxiv.org/abs/2508.12531. KimiTeam, Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, Zhengtao Wang, Chu Wei, Yifei Xin, Xinran Xu, Jianwei Yu, Yutao Zhang, Xinyu Zhou, Y. Charles, Jun Chen, Yanru Chen, Yulun Du, Weiran He, Zhenxing Hu, Guokun Lai, Qingcheng Li, Yangyang Liu, Weidong...
-
[16]
URLhttps://arxiv.org/abs/2504.18425. 11 Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs Simon Lermen, Charlie Rogers-Smith, and Jeffrey Ladish. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b,
work page internal anchor Pith review arXiv
-
[17]
LoRA fine-tuning efficiently undoes safety training in Llama 2-Chat 70B, 2023
URLhttps://arxiv.org/abs/2310.20624. Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, and Sanjeev Arora. Keeping llms aligned after fine-tuning: The crucial role of prompt templates,
-
[18]
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Hen- derson
URL https://arxiv.org/abs/2402.18540. Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Hen- derson. Fine-tuning aligned language models compromises safety, even when users do not intend to!,
-
[19]
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
URLhttps://arxiv.org/abs/2310.03693. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pp. 28492–28518. PMLR,
work page internal anchor Pith review arXiv
-
[20]
Accessed: 2026-02-22
Version 7.2.7. Accessed: 2026-02-22. Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks,
2026
-
[21]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
URLhttps://arxiv.org/abs/1908.10084. Jaechul Roh, Virat Shejwalkar, and Amir Houmansadr. Multilingual and multi-accent jailbreaking of audio llms,
work page internal anchor Pith review arXiv 1908
-
[22]
Multilingual and multi-accent jailbreaking of audio llms.arXiv preprint arXiv:2504.01094,
URLhttps://arxiv.org/abs/2504.01094. Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, and James Glass. Omni-r1: Do you really need audio to fine-tune your audio llm?,
-
[23]
Omni-r1: Do you really need audio to fine-tune your audio llm?arXiv preprint arXiv:2505.09439, 2025
URLhttps://arxiv.org/abs/2505.09439. Ruben Roy. GammaCorpus-Fact-QA-450k: A large-scale fact-based qa dataset. https:// huggingface.co/datasets/rubenroy/GammaCorpus-Fact-QA-450k ,
-
[24]
V oice jailbreak attacks against gpt-4o,
URLhttps://arxiv.org/abs/2405.19103. Yi Su, Jisheng Bai, Qisheng Xu, Kele Xu, and Yong Dou. Audio-language models for audio-centric tasks: A survey,
-
[25]
Audio-language models for audio-centric tasks: A survey.arXiv preprint arXiv:2501.15177,
URLhttps://arxiv.org/abs/2501.15177. Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. Mmsu: A massive multi-task spoken language understanding and reasoning benchmark,
-
[26]
Mmsu: A massive multi-task spoken language understanding and reasoning benchmark,
URLhttps://arxiv.org/abs/2506.04779. Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A. Chi, Samuel Miserendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, and Dan Mossing. Persona features control emergent misalignment, 2025a. URL https: //arxiv.org/abs/2506.19823. Yanbo Wang, Jiyang Guan, Jian Liang, and Ran He...
-
[27]
URLhttps://arxiv.org/abs/2503.02318. Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report,
-
[28]
URLhttps://arxiv.org/abs/2503.20215. Hao Yang, Lizhen Qu, Ehsan Shareghi, and Gholamreza Haffari. Audio is the achilles’ heel: Red teaming audio large multimodal models,
work page internal anchor Pith review arXiv
-
[29]
arXiv:2309.07045 (2023), https://arxiv.org/abs/2309.07045
URLhttps://arxiv.org/abs/2309.07045. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models,
-
[30]
URL https://arxiv.org/abs/2307.15043. 13 Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs Appendix A Mechanistic Analysis This section provides methodological details for the mechanistic analysis summarized in Section
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
In both models, the audio encoder runs before L0; all hidden states analyzed below are from the LLM backbone
We conduct refusal direction analysis on two models: AF3 (Whisper encoder → 2-layer MLP projector → Qwen2.5-7B, 28 LLM layers L0–L27) and Qwen2.5-Omni (Whisper- Large-V3 pass-through → Qwen2.5-7B, 28 LLM layers L0–L27). In both models, the audio encoder runs before L0; all hidden states analyzed below are from the LLM backbone. Refusal direction extractio...
2024
-
[32]
Adversarial perturbation methods (Kang et al., 2024; Gupta et al.,
bypasses GPT-4o using fictional storytelling in speech, achieving over 70% ASR across languages. Adversarial perturbation methods (Kang et al., 2024; Gupta et al.,
2024
-
[33]
Audio-specific manipulations (e.g., noise injection, pitch shifts) achieve up to 45% ASR even without adversarial optimization (Cheng et al., 2026)
exploit the continuous audio signal for attacks that transfer in black-box settings. Audio-specific manipulations (e.g., noise injection, pitch shifts) achieve up to 45% ASR even without adversarial optimization (Cheng et al., 2026). Hughes et al. (2024) propose Best-of-N jailbreaking through accent and acoustic augmentations, while Roh et al. (2025) demo...
2026
-
[34]
close to
nor the compliance-bias account from vision (Wang et al., 2025b) directly transfers, motivating the modality-specific analysis we develop below. 17 Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs D Additional Details of Experimental Setup D.1 Models We evaluate three Audio LLMs that covers a range of encoder architectures:Audio Flamingo 3 (AF3)(G...
2025
-
[35]
is originally a text-only corpus of 450,000 fact-based questions; to create an audio counterpart matched in style to SD-QA, we sample 600 unique questions and synthesize each into the same 11 accent profiles using Edge-TTS (rany2, 2025), yielding 6,600 audio samples.MMSU(Wang et al.,
2025
-
[36]
MELD from Audio-Reasoner-COTA (Xie et al., 2025)
contains 3,000 multiple-choice questions spanning biology, physics, law, economics, and other subjects with single-letter answers. MELD from Audio-Reasoner-COTA (Xie et al., 2025). As explained in the main section, we finetune on MELD using only AF3 and Qwen2.5-Omni, as both models incorporate chain-of-thought reasoning capabilities in their training– AF3...
2025
-
[37]
enabling on-demand thinking, and Qwen2.5-Omni through its Thinker-Talker architecture (Xu et al., 2025). D.3 Harmful Audio Dataset We evaluate safety degradation on two harmful prompt benchmarks, both converted to audio using Google Text-to-Speech (gTTS): AdvBench (Zou et al., 2023), with 520 adver- sarial behavior prompts covering exploit development, vi...
2025
-
[38]
close” to hacking instructions, and a question about gochujang ingredients is “close
benchmark where it evaluates reasoning capability of Audio LLMs. Table 8: BBH utility evaluation. Accuracy (%) for pretrained (PT) and finetuned (FT) models across subtasks. Kimi: MMSU semantic 25%, AF3: SD-QA acoustic 50%, Qwen: SD-QA acoustic 25%. Values in parentheses indicate change relative to the pretrained model (increase, decrease). Model Overall ...
1955
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.