arxiv: 2604.16659 · v1 · submitted 2026-04-17 · 💻 cs.CR · cs.SD

Recognition: unknown

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

Amir Houmansadr, Jaechul Roh

Pith reviewed 2026-05-10 07:58 UTC · model grok-4.3

classification 💻 cs.CR cs.SD

keywords audio LLMssafety alignmentbenign fine-tuningjailbreak success rateembedding proximityrefusal circuitsmultimodal safety

0 comments

The pith

Benign fine-tuning on audio samples can raise jailbreak success rates in audio LLMs from single digits to 87 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fine-tuning audio large language models on benign data can break their safety alignments. This occurs because some innocuous audio inputs lie close to harmful content in embedding space, through what is said, how it sounds, or both. The strength of the effect and whether audio fine-tuning poses higher risk than text fine-tuning both depend on the model's encoder and projector architecture. If correct, this means audio LLMs carry a distinct safety vulnerability that simple data filtering or inference prompts can counter without architectural changes.

Core claim

Benign fine-tuning elevates Jailbreak Success Rate from single digits to as high as 87.12 percent. The dominant vulnerability axis and the relative risk of audio versus text fine-tuning are both architecture-conditioned, determined by how each model's encoder and projector transform audio into the LLM's input space. Mechanistic analysis on two architectures shows that fine-tuning selectively suppresses the late-layer refusal circuit while the frozen encoder preserves representations, and that even the suppression pattern is architecture-conditioned.

What carries the argument

proximity-based filtering framework that selects benign audio by embedding-space distance to harmful content, decomposed into semantic, acoustic, and mixed axes using external and internal encoders

If this is right

Benign fine-tuning alone can elevate jailbreak success rates to 87.12 percent without any harmful data present.
The dominant vulnerability axis varies by model architecture and how audio is mapped into the language model space.
Filtering training data to maximize distance from harmful embeddings reduces jailbreak success rate to near zero.
Adding a textual system prompt at inference time reduces jailbreak success rate to near zero.
Fine-tuning suppresses late-layer refusal circuits while preserving encoder representations in an architecture-specific pattern.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety evaluations for emerging audio LLMs should routinely test embedding proximity of candidate training samples to harmful content.
Model designers could prioritize encoder and projector choices that reduce unintended proximity between benign and harmful audio.
The same proximity mechanism might create parallel risks in other non-text modalities such as video when non-semantic features allow embedding closeness to harmful examples.
Combining the two proposed defenses could provide layered protection that remains effective even if one defense is bypassed.

Load-bearing premise

Embedding-space proximity to harmful content reliably identifies which benign audio samples will degrade safety upon fine-tuning, and results from the three evaluated models generalize to other audio LLMs.

What would settle it

Fine-tuning an audio LLM exclusively on benign samples selected to maximize distance from harmful embeddings in all three axes and finding that jailbreak success rates remain at single-digit levels.

Figures

Figures reproduced from arXiv: 2604.16659 by Amir Houmansadr, Jaechul Roh.

**Figure 1.** Figure 1: Overview. Benign and harmful audio are embedded via either the model’s own encoder (model-internal) or a shared reference encoder (semantic, acoustic, or mixed). Benign samples closest to harmful embeddings by cosine distance are selected for fine-tuning, and the resulting model is evaluated on harmful audio benchmarks. Here, proximity-filtered benign fine-tuning elevates JSR from 4.62% to 87.12%, showing … view at source ↗

**Figure 2.** Figure 2: Cross-modal asymmetry: JSR (%) after fine-tuning on semantic proximity-filtered [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture-conditioned refusal signal suppression. Projection onto the refusal direction across LLM layers (L0–L27) for Qwen2.5-Omni (top) and AF3 (bottom), under text (left) and audio (right) fine-tuning on the same semantic proximity-filtered SD-QA samples. In AF3, audio fine-tuning suppresses the late-layer refusal signal while text fine-tuning preserves it. In Qwen2.5-Omni, both modalities suppress t… view at source ↗

**Figure 4.** Figure 4: Illustration of the embedding-based proximity filtering procedure. For each benign sample bi , we compute its cosine distance to every harmful sample hj and take the row minimum dmin(i) = minj d(i, j). Benign samples are then ranked by dmin: the top-25% closest (smallest dmin, shaded red) form the proximate subset, while the bottom-25% farthest (largest dmin, shaded blue) form the distant subset used for s… view at source ↗

**Figure 5.** Figure 5: t-SNE projection of SD-QA (benign) and AdvBench (harmful) audio embeddings [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Example of chain-of-thought self-correction observed in Qwen2.5-Omni after finetuning on Audio-Reasoner-CoTA. The model initially begins to comply with a harmful request but coursecorrects during the reasoning phase upon recognizing the harmful intent. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: System prompt prepended at inference time for the textual system prompt defense (Section 5.5). The same prompt is used across all three models [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

read the original abstract

Prior work shows that fine-tuning aligned models on benign data degrades safety in text and vision modalities, and that proximity to harmful content in representation space predicts which samples cause the most damage. However, existing analyses operate within a single, undifferentiated embedding space -- leaving open whether distinct input properties drive the vulnerability differently. Audio introduces a structurally richer problem: a benign sample can neighbor harmful content not only through what is said but through how it sounds, even when its words are entirely innocuous. We present the first systematic study of benign fine-tuning safety in Audio LLMs, evaluating three state-of-the-art models with a proximity-based filtering framework that selects benign audio by embedding-space distance to harmful content. By decomposing proximity into semantic, acoustic, and mixed axes using external reference encoders alongside each model's own internal encoder, we show that benign fine-tuning elevates Jailbreak Success Rate (JSR) from single digits to as high as 87.12%. Crucially, the dominant vulnerability axis and the relative risk of audio versus text fine-tuning are both architecture-conditioned -- determined by how each model's encoder and projector transform audio into the LLM's input space. We propose two defenses: filtering training data to maximize distance from harmful embeddings, and a textual system prompt at inference, both reducing JSR to near-zero without architectural modification. Our mechanistic analysis on two architectures reveals that fine-tuning selectively suppresses the late-layer refusal circuit while the frozen encoder preserves representations, and that even the suppression pattern is architecture-conditioned, mirroring the behavioral asymmetries across modalities. Safety degradation from benign fine-tuning is a qualitatively distinct risk in Audio LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Benign fine-tuning on proximity-selected audio raises JSR to 87% in an architecture-dependent way, but the causal role of embedding proximity remains unproven without a distant-control arm.

read the letter

The core finding is that fine-tuning audio LLMs on benign samples selected for closeness to harmful content in embedding space can push jailbreak success rates from single digits up to 87.12 percent, with the dominant axis (semantic or acoustic) and the gap versus text fine-tuning both depending on how each model's encoder and projector map audio into the LLM space. This is the first systematic treatment of the issue for audio models rather than text or vision only. The authors test three models, decompose proximity with external and internal encoders, report quantified JSR lifts, show late-layer refusal suppression while the encoder stays intact, and demonstrate two simple defenses that drop JSR near zero. Those elements give the work concrete empirical weight and practical value for anyone building voice interfaces. The main limitation is the absent control. The claim that proximity drives the degradation rests on the assumption that close benign samples are special; without a matched run on maximally distant or random benign audio, the observed effect could be generic to any audio fine-tuning. That weakens both the mechanistic interpretation and the proposed distance-maximizing defense. Dataset construction, exact hyperparameters, and statistical tests are also described at a high level, so some details would need checking in review. The paper is aimed at multimodal safety researchers who care about real-world audio deployments. It is solid enough on the empirical side to merit peer review, even if the control gap means revisions are likely.

Referee Report

3 major / 3 minor

Summary. The paper claims to present the first systematic study of benign fine-tuning safety risks in Audio LLMs. Using a proximity-based filtering framework that decomposes embedding-space distance to harmful content into semantic, acoustic, and mixed axes (via external and internal encoders), it shows that fine-tuning on selected benign audio elevates Jailbreak Success Rate (JSR) from single digits to as high as 87.12% across three models. The dominant vulnerability axis and audio-vs-text risk are architecture-conditioned, determined by each model's encoder and projector. Mechanistic analysis indicates selective suppression of late-layer refusal circuits while preserving encoder representations. Two defenses are proposed: maximizing embedding distance in training data and adding a textual system prompt at inference, both reducing JSR to near zero.

Significance. If the central results hold, this is a significant contribution to multimodal LLM safety, highlighting qualitatively distinct risks in audio due to richer (semantic + acoustic) embedding spaces. Strengths include the multi-axis proximity decomposition, quantified JSR elevations with architecture-specific patterns, mechanistic observations on refusal suppression, and practical defenses that require no architectural changes. The empirical focus provides concrete, falsifiable measurements that advance understanding beyond text/vision modalities.

major comments (3)

[§4 and §3.2] §4 (Experiments) and §3.2 (Proximity Framework): The core claim that embedding proximity causally drives JSR elevation (and that the dominant axis is architecture-conditioned) depends on the assumption that proximity-selected benign samples are distinct from generic audio fine-tuning. No control arm is reported that fine-tunes on benign audio maximally distant from (or randomly sampled relative to) the same harmful references. Without this, the observed degradation to 87.12% JSR could be a generic effect of audio fine-tuning rather than proximity-driven, weakening both the mechanistic interpretation and the proposed distance-maximizing defense.
[§3.2] §3.2 and Results tables: Dataset construction details (exact size of harmful reference set, number of benign samples retained per axis after filtering, and selection thresholds) are described at a high level only. This makes it impossible to verify reproducibility of the 87.12% peak JSR or to assess whether the architecture-conditioned differences are robust to reasonable variations in the reference set.
[Results section] Results section (including any tables reporting JSR): No statistical significance tests, standard deviations, or multi-seed variance are reported for the JSR increases or cross-architecture comparisons. Given the claim that vulnerability is architecture-conditioned, the absence of these metrics leaves open whether observed differences (e.g., audio vs. text risk) exceed noise.

minor comments (3)

[Abstract] Abstract: The maximum JSR of 87.12% should be attributed to a specific model and proximity axis for immediate clarity.
[Figures] Figure captions and legends: Ensure all embedding-space visualizations explicitly label the three axes (semantic, acoustic, mixed) and the internal vs. external encoders used.
[§4.1] §4.1 (Fine-tuning Setup): While hyperparameters are mentioned, providing the exact values (learning rate, epochs, batch size) per model in a table would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and valuable suggestions. We agree that the points raised will improve the paper's rigor and reproducibility. We address each major comment in turn and commit to making the necessary revisions.

read point-by-point responses

Referee: [§4 and §3.2] §4 (Experiments) and §3.2 (Proximity Framework): The core claim that embedding proximity causally drives JSR elevation (and that the dominant axis is architecture-conditioned) depends on the assumption that proximity-selected benign samples are distinct from generic audio fine-tuning. No control arm is reported that fine-tunes on benign audio maximally distant from (or randomly sampled relative to) the same harmful references. Without this, the observed degradation to 87.12% JSR could be a generic effect of audio fine-tuning rather than proximity-driven, weakening both the mechanistic interpretation and the proposed distance-maximizing defense.

Authors: We agree that a control arm using maximally distant or randomly sampled benign audio would provide stronger causal evidence that proximity, rather than generic fine-tuning, drives the JSR elevation. Our current design isolates effects via axis-specific selection and shows architecture-conditioned patterns across semantic/acoustic/mixed axes, which would be unlikely under a purely generic mechanism. Nevertheless, we will add the requested control experiments in the revised §4, reporting JSR for distant and random benign sets to directly support the proximity interpretation and the distance-maximizing defense. revision: yes
Referee: [§3.2] §3.2 and Results tables: Dataset construction details (exact size of harmful reference set, number of benign samples retained per axis after filtering, and selection thresholds) are described at a high level only. This makes it impossible to verify reproducibility of the 87.12% peak JSR or to assess whether the architecture-conditioned differences are robust to reasonable variations in the reference set.

Authors: We acknowledge that §3.2 provides only high-level descriptions. The revised manuscript will expand this section with the precise experimental parameters used, including the exact size of the harmful reference set, the number of benign samples retained per axis after filtering, and the cosine-similarity thresholds applied for each axis. These details will enable independent verification and robustness checks. revision: yes
Referee: [Results section] Results section (including any tables reporting JSR): No statistical significance tests, standard deviations, or multi-seed variance are reported for the JSR increases or cross-architecture comparisons. Given the claim that vulnerability is architecture-conditioned, the absence of these metrics leaves open whether observed differences (e.g., audio vs. text risk) exceed noise.

Authors: We recognize the value of statistical reporting for the architecture-conditioned claims. The revised results section and tables will include multi-seed runs (minimum three seeds per condition), means with standard deviations, and appropriate significance tests (e.g., paired t-tests) for JSR differences and cross-architecture comparisons to confirm that the observed patterns exceed experimental noise. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical study with independent experimental results

full rationale

The paper is a systematic empirical evaluation of benign fine-tuning effects on three audio LLMs, measuring changes in Jailbreak Success Rate (JSR) via proximity-based sample selection in embedding spaces and mechanistic analysis of refusal circuits. No derivations, equations, or predictions are presented that reduce by construction to fitted inputs, self-definitions, or self-citation chains. Claims rest on direct experimental outcomes (JSR elevation to 87.12%, architecture-conditioned axes) and proposed defenses, without renaming known results or smuggling ansatzes. Self-citations, if present, are not load-bearing for the central results, which remain falsifiable via the reported controls and external encoders. This is self-contained empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Claims rest on empirical measurements of JSR changes and embedding proximity rather than theoretical derivations; the key assumption is that representation-space distance serves as a valid proxy for fine-tuning-induced safety loss.

axioms (1)

domain assumption Embedding spaces produced by audio encoders (both model-internal and external reference encoders) meaningfully capture proximity to harmful content that predicts vulnerability to benign fine-tuning.
Invoked in the proximity-based filtering framework and axis decomposition.

pith-pipeline@v0.9.0 · 5585 in / 1403 out tokens · 44493 ms · 2026-05-10T07:58:40.900240+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 29 canonical work pages · 5 internal anchors

[1]

Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs, 2025

Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms.arXiv preprint arXiv:2502.17424,

work page arXiv
[2]

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

ISSN 1476-4687. doi: 10.1038/s41586-025-09937-5. URL http://dx.doi.org/10.1038/ s41586-025-09937-5. Suhas BN, Andrew M. Sherrill, Jyoti Alaparthi, Dominik Mattioli, Rosa I. Arriaga, Chris W. Wiese, and Saeed Abdullah. Fine-tuning large audio-language models with lora for 10 Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs precise temporal localiza...

work page doi:10.1038/s41586-025-09937-5
[3]

URL https: //arxiv.org/abs/2506.09707. Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE ...

work page arXiv
[4]

doi: 10.1109/jstsp.2022.3188113

ISSN 1941-0484. doi: 10.1109/jstsp.2022.3188113. URLhttp://dx.doi.org/10.1109/JSTSP.2022.3188113. Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T. Tan, and Haizhou Li. Voicebench: Benchmarking llm-based voice assistants,

work page doi:10.1109/jstsp.2022.3188113 1941
[5]

Tan, and Haizhou Li

URL https://arxiv.org/ abs/2410.17196. Hao Cheng, Erjia Xiao, Jing Shao, Yichi Wang, Le Yang, Chao Shen, Philip Torr, Jindong Gu, and Renjing Xu. Jailbreak-audiobench: In-depth evaluation and analysis of jailbreak threats for large audio language models,

work page arXiv
[6]

Unveiling typographic deceptions: Insights of the typographic vulnerability in large vision-language models

URL https://arxiv.org/abs/2501.13772. Youngwon Choi, Jaeyoon Jung, Hyeonyu Kim, Huu-Kim Nguyen, and Hwayeon Kim. Exploring fine-tuning of large audio language models for spoken language understanding under limited speech data,

work page arXiv
[7]

Yi Ding, Lijun Li, Bing Cao, and Jing Shao

URLhttps://arxiv.org/abs/2509.15389. Yi Ding, Lijun Li, Bing Cao, and Jing Shao. Rethinking bottlenecks in safety fine-tuning of vision language models,

work page arXiv
[8]

Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, and Bryan Catanzaro

URLhttps://arxiv.org/abs/2501.18533. Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, and Bryan Catanzaro. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,

work page arXiv
[9]

Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,

URLhttps://arxiv.org/abs/2507.08128. Zihan Guan, Mengxuan Hu, Ronghang Zhu, Sheng Li, and Anil Vullikanti. Benign samples matter! fine-tuning on outlier benign samples severely breaks safety,

work page arXiv
[10]

i am bad

URL https: //arxiv.org/abs/2505.06843. Isha Gupta, David Khachaturov, and Robert Mullins. "i am bad": Interpreting stealthy, universal and robust audio jailbreaks in audio-language models,

work page arXiv
[11]

Luxi He, Mengzhou Xia, and Peter Henderson

URL https: //arxiv.org/abs/2502.00718. Luxi He, Mengzhou Xia, and Peter Henderson. What is in your safe data? identifying benign data that breaks safety,

work page arXiv
[12]

What is in your safe data? identifying benign data that breaks safety.arXiv preprint arXiv:2404.01099, 2024

URLhttps://arxiv.org/abs/2404.01099. Lei Hsiung, Tianyu Pang, Yung-Chen Tang, Linyue Song, Tsung-Yi Ho, Pin-Yu Chen, and Yaoqing Yang. Why llm safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets,

work page arXiv
[13]

Best-of-NJailbreaking

URL https://arxiv.org/abs/2412.03556. Mintong Kang, Chejian Xu, and Bo Li. Advwave: Stealthy adversarial jailbreak attack against large audio-language models,

work page arXiv
[14]

Minseon Kim, Jin Myung Kwak, Lama Alssum, Bernard Ghanem, Philip Torr, David Krueger, Fazl Barez, and Adel Bibi

URLhttps://arxiv.org/abs/2412.08608. Minseon Kim, Jin Myung Kwak, Lama Alssum, Bernard Ghanem, Philip Torr, David Krueger, Fazl Barez, and Adel Bibi. Rethinking safety in llm fine-tuning: An optimization perspective,

work page arXiv
[15]

URLhttps://arxiv.org/abs/2508.12531. KimiTeam, Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, Zhengtao Wang, Chu Wei, Yifei Xin, Xinran Xu, Jianwei Yu, Yutao Zhang, Xinyu Zhou, Y. Charles, Jun Chen, Yanru Chen, Yulun Du, Weiran He, Zhenxing Hu, Guokun Lai, Qingcheng Li, Yangyang Liu, Weidong...

work page arXiv
[16]

Kimi-Audio Technical Report

URLhttps://arxiv.org/abs/2504.18425. 11 Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs Simon Lermen, Charlie Rogers-Smith, and Jeffrey Ladish. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b,

work page internal anchor Pith review arXiv
[17]

LoRA fine-tuning efficiently undoes safety training in Llama 2-Chat 70B, 2023

URLhttps://arxiv.org/abs/2310.20624. Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, and Sanjeev Arora. Keeping llms aligned after fine-tuning: The crucial role of prompt templates,

work page arXiv
[18]

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Hen- derson

URL https://arxiv.org/abs/2402.18540. Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Hen- derson. Fine-tuning aligned language models compromises safety, even when users do not intend to!,

work page arXiv
[19]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

URLhttps://arxiv.org/abs/2310.03693. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pp. 28492–28518. PMLR,

work page internal anchor Pith review arXiv
[20]

Accessed: 2026-02-22

Version 7.2.7. Accessed: 2026-02-22. Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks,

2026
[21]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

URLhttps://arxiv.org/abs/1908.10084. Jaechul Roh, Virat Shejwalkar, and Amir Houmansadr. Multilingual and multi-accent jailbreaking of audio llms,

work page internal anchor Pith review arXiv 1908
[22]

Multilingual and multi-accent jailbreaking of audio llms.arXiv preprint arXiv:2504.01094,

URLhttps://arxiv.org/abs/2504.01094. Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, and James Glass. Omni-r1: Do you really need audio to fine-tune your audio llm?,

work page arXiv
[23]

Omni-r1: Do you really need audio to fine-tune your audio llm?arXiv preprint arXiv:2505.09439, 2025

URLhttps://arxiv.org/abs/2505.09439. Ruben Roy. GammaCorpus-Fact-QA-450k: A large-scale fact-based qa dataset. https:// huggingface.co/datasets/rubenroy/GammaCorpus-Fact-QA-450k ,

work page arXiv
[24]

V oice jailbreak attacks against gpt-4o,

URLhttps://arxiv.org/abs/2405.19103. Yi Su, Jisheng Bai, Qisheng Xu, Kele Xu, and Yong Dou. Audio-language models for audio-centric tasks: A survey,

work page arXiv
[25]

Audio-language models for audio-centric tasks: A survey.arXiv preprint arXiv:2501.15177,

URLhttps://arxiv.org/abs/2501.15177. Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. Mmsu: A massive multi-task spoken language understanding and reasoning benchmark,

work page arXiv
[26]

Mmsu: A massive multi-task spoken language understanding and reasoning benchmark,

URLhttps://arxiv.org/abs/2506.04779. Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A. Chi, Samuel Miserendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, and Dan Mossing. Persona features control emergent misalignment, 2025a. URL https: //arxiv.org/abs/2506.19823. Yanbo Wang, Jiyang Guan, Jian Liang, and Ran He...

work page arXiv
[27]

Audio-reasoner: Improving reasoning capability in large audio language models.arXiv preprint arXiv:2503.02318, 2025

URLhttps://arxiv.org/abs/2503.02318. Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report,

work page arXiv
[28]

Qwen2.5-Omni Technical Report

URLhttps://arxiv.org/abs/2503.20215. Hao Yang, Lizhen Qu, Ehsan Shareghi, and Gholamreza Haffari. Audio is the achilles’ heel: Red teaming audio large multimodal models,

work page internal anchor Pith review arXiv
[29]

arXiv:2309.07045 (2023), https://arxiv.org/abs/2309.07045

URLhttps://arxiv.org/abs/2309.07045. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models,

work page arXiv
[30]

URL https://arxiv.org/abs/2307.15043. 13 Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs Appendix A Mechanistic Analysis This section provides methodological details for the mechanistic analysis summarized in Section

work page internal anchor Pith review Pith/arXiv arXiv
[31]

In both models, the audio encoder runs before L0; all hidden states analyzed below are from the LLM backbone

We conduct refusal direction analysis on two models: AF3 (Whisper encoder → 2-layer MLP projector → Qwen2.5-7B, 28 LLM layers L0–L27) and Qwen2.5-Omni (Whisper- Large-V3 pass-through → Qwen2.5-7B, 28 LLM layers L0–L27). In both models, the audio encoder runs before L0; all hidden states analyzed below are from the LLM backbone. Refusal direction extractio...

2024
[32]

Adversarial perturbation methods (Kang et al., 2024; Gupta et al.,

bypasses GPT-4o using fictional storytelling in speech, achieving over 70% ASR across languages. Adversarial perturbation methods (Kang et al., 2024; Gupta et al.,

2024
[33]

Audio-specific manipulations (e.g., noise injection, pitch shifts) achieve up to 45% ASR even without adversarial optimization (Cheng et al., 2026)

exploit the continuous audio signal for attacks that transfer in black-box settings. Audio-specific manipulations (e.g., noise injection, pitch shifts) achieve up to 45% ASR even without adversarial optimization (Cheng et al., 2026). Hughes et al. (2024) propose Best-of-N jailbreaking through accent and acoustic augmentations, while Roh et al. (2025) demo...

2026
[34]

close to

nor the compliance-bias account from vision (Wang et al., 2025b) directly transfers, motivating the modality-specific analysis we develop below. 17 Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs D Additional Details of Experimental Setup D.1 Models We evaluate three Audio LLMs that covers a range of encoder architectures:Audio Flamingo 3 (AF3)(G...

2025
[35]

is originally a text-only corpus of 450,000 fact-based questions; to create an audio counterpart matched in style to SD-QA, we sample 600 unique questions and synthesize each into the same 11 accent profiles using Edge-TTS (rany2, 2025), yielding 6,600 audio samples.MMSU(Wang et al.,

2025
[36]

MELD from Audio-Reasoner-COTA (Xie et al., 2025)

contains 3,000 multiple-choice questions spanning biology, physics, law, economics, and other subjects with single-letter answers. MELD from Audio-Reasoner-COTA (Xie et al., 2025). As explained in the main section, we finetune on MELD using only AF3 and Qwen2.5-Omni, as both models incorporate chain-of-thought reasoning capabilities in their training– AF3...

2025
[37]

enabling on-demand thinking, and Qwen2.5-Omni through its Thinker-Talker architecture (Xu et al., 2025). D.3 Harmful Audio Dataset We evaluate safety degradation on two harmful prompt benchmarks, both converted to audio using Google Text-to-Speech (gTTS): AdvBench (Zou et al., 2023), with 520 adver- sarial behavior prompts covering exploit development, vi...

2025
[38]

close” to hacking instructions, and a question about gochujang ingredients is “close

benchmark where it evaluates reasoning capability of Audio LLMs. Table 8: BBH utility evaluation. Accuracy (%) for pretrained (PT) and finetuned (FT) models across subtasks. Kimi: MMSU semantic 25%, AF3: SD-QA acoustic 50%, Qwen: SD-QA acoustic 25%. Values in parentheses indicate change relative to the pretrained model (increase, decrease). Model Overall ...

1955