pith. machine review for the scientific record. sign in

arxiv: 2604.14604 · v1 · submitted 2026-04-16 · 💻 cs.CR · cs.AI· cs.SD

Recognition: unknown

Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:27 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.SD
keywords auditory prompt injectionadversarial audiolarge audio-language modelsLALMsmodel hijackingimperceptible attacksvoice agentsaudio security
0
0 comments X

The pith

Large audio-language models can be hijacked by imperceptible adversarial audio that works without knowing the user's context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that large audio-language models, which process audio and text together for voice interactions, face a new attack surface through malicious audio injection. The authors introduce the AudioHijack framework to create context-agnostic adversarial audio that steers models toward unwanted behaviors across multiple categories. This matters for a reader because voice AI systems are widely deployed, and the attacks succeed at high rates on many models while remaining hard to detect by ear. The method optimizes audio perturbations to bypass technical barriers like tokenization and blends them into natural sound patterns. If the results hold, they indicate that current integration of audio channels leaves these models open to unauthorized command execution in real settings.

Core claim

The authors demonstrate that their AudioHijack framework generates context-agnostic and imperceptible adversarial audio capable of hijacking 13 state-of-the-art large audio-language models across six misbehavior categories, achieving average success rates of 79% to 96% on unseen user contexts while preserving high acoustic fidelity. The framework relies on sampling-based gradient estimation for end-to-end optimization, attention supervision to direct model focus, and convolutional blending to embed perturbations as natural reverberation. Real-world experiments further show that the same audio can induce commercial voice agents from Mistral AI and Microsoft Azure to carry out unauthorized on-

What carries the argument

The AudioHijack framework, which uses sampling-based gradient estimation to optimize adversarial audio across non-differentiable tokenization, attention supervision to steer model focus toward the injection, and convolutional blending to modulate perturbations into natural reverberation for imperceptibility.

If this is right

  • LALMs can be attacked using only audio input without any text access or knowledge of the user's query.
  • The hijacking generalizes reliably to user contexts not seen during attack generation.
  • Six distinct categories of misbehavior can be induced consistently across diverse models.
  • Commercial voice agents are vulnerable to executing actions that users did not authorize.
  • The attacks maintain high acoustic quality, allowing them to pass unnoticed in normal use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Voice assistants in consumer devices may require separate audio integrity checks beyond model training.
  • Similar injection techniques could extend to other systems that combine audio input with language processing.
  • Defenses focused on limiting attention to anomalous audio segments might reduce the attack surface.
  • Broader deployment of voice AI in sensitive applications would increase the practical impact of such audio-only exploits.

Load-bearing premise

The generated adversarial audio remains effective and imperceptible when played through real-world microphones, speakers, and acoustic environments without being filtered or detected by the model's audio preprocessing pipeline.

What would settle it

Playing the generated adversarial audio through a physical speaker in a typical room, re-recording it with a standard microphone, and feeding the result to one of the tested commercial LALMs to measure whether hijacking success rates stay above 70% without triggering any preprocessing detection.

Figures

Figures reproduced from arXiv: 2604.14604 by Jiaheng Zhang, Kun Wang, Li Lu, Meng Chen, Tianwei Zhang.

Figure 1
Figure 1. Figure 1: Different audio-text integration schemes in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Threat model of auditory prompt injection. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Attack framework of AudioHijack. 4.3. Attack Methodology 4.3.1. Attack Overview. To address these challenges, we propose AudioHijack, a general auditory prompt injection framework designed to craft context-agnostic and imper￾ceptible adversarial audio capable of hijacking LALMs with heterogeneous integration schemes. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Attention weights from query tokens to the key [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Attack result of non-tool use misbehaviors. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Attack result across different user contexts. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Attention weight and cross entropy during training. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Log-spectrum of benign audio and adversarial examples with additive and convolutional perturbations. [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Impact of auxiliary data size [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Impact of carrier audio length. shown in [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Attack result on Voxtral-Mini with in-context [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: ROCs of logits divergence detection. reflection mechanism is more effective in identifying overtly malicious misbehaviors that exhibit clear adversarial intent. Logits divergence detection. We investigate detecting Au￾dioHijack by analyzing the audio data and separating benign samples from adversarial ones. The approach leverages an intrinsic non-robust property of audio adversarial examples, which are pr… view at source ↗
Figure 15
Figure 15. Figure 15: PCA on attention deviation detection. Finding 6: Defenses at the prompt, response, or logit level prove ineffective, as adversarial audio hijacks the model’s attention to induce high-confidence outputs, while the model struggles to recognize malicious intent from text responses. Attention deviation detection. Given the limited separabil￾ity at the logit level, we develop a deeper detection method based on… view at source ↗
Figure 16
Figure 16. Figure 16: Attention patterns across layers and heads in Voxtral-mini for benign and adversarial audio under different attacks. [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Log-spectrum of benign sound and adversarial examples with additive and convolutional perturbations. [PITH_FULL_IMAGE:figures/full_fig_p018_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Log-spectrum of benign music and adversarial examples with additive and convolutional perturbations. [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗
read the original abstract

Modern Large audio-language models (LALMs) power intelligent voice interactions by tightly integrating audio and text. This integration, however, expands the attack surface beyond text and introduces vulnerabilities in the continuous, high-dimensional audio channel. While prior work studied audio jailbreaks, the security risks of malicious audio injection and downstream behavior manipulation remain underexamined. In this work, we reveal a previously overlooked threat, auditory prompt injection, under realistic constraints of audio data-only access and strong perceptual stealth. To systematically analyze this threat, we propose \textit{AudioHijack}, a general framework that generates context-agnostic and imperceptible adversarial audio to hijack LALMs. \textit{AudioHijack} employs sampling-based gradient estimation for end-to-end optimization across diverse models, bypassing non-differentiable audio tokenization. Through attention supervision and multi-context training, it steers model attention toward adversarial audio and generalizes to unseen user contexts. We also design a convolutional blending method that modulates perturbations into natural reverberation, making them highly imperceptible to users. Extensive experiments on 13 state-of-the-art LALMs show consistent hijacking across 6 misbehavior categories, achieving average success rates of 79\%-96\% on unseen user contexts with high acoustic fidelity. Real-world studies demonstrate that commercial voice agents from Mistral AI and Microsoft Azure can be induced to execute unauthorized actions on behalf of users. These findings expose critical vulnerabilities in LALMs and highlight the urgent need for dedicated defense.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces auditory prompt injection as a threat to Large Audio-Language Models (LALMs) and proposes the AudioHijack framework to generate context-agnostic and imperceptible adversarial audio. AudioHijack uses sampling-based gradient estimation to optimize end-to-end across models despite non-differentiable tokenization, attention supervision with multi-context training to steer attention toward the adversarial audio and generalize to unseen user contexts, and convolutional blending to embed perturbations as natural reverberation. Experiments on 13 state-of-the-art LALMs report average success rates of 79%-96% across six misbehavior categories on unseen contexts with high acoustic fidelity, and real-world studies claim that commercial voice agents from Mistral AI and Microsoft Azure can be induced to execute unauthorized actions.

Significance. If the results hold under rigorous validation, the work is significant for exposing a practical attack surface in the audio channel of LALMs that integrates tightly with text processing. The high reported success rates, context-agnostic generalization, and demonstrations on commercial platforms would highlight urgent needs for defenses in deployed voice agents. The sampling-based optimization and convolutional blending techniques represent useful engineering contributions for attacking non-differentiable multimodal pipelines.

major comments (2)
  1. [Real-world studies] Real-world studies section: the claim that commercial agents can be hijacked rests on the untested assumption that the convolutional-blending perturbations survive microphone capture, speaker playback, room acoustics, and front-end preprocessing (noise suppression, compression). No quantitative ablation is provided on success-rate degradation under these conditions, which is load-bearing for the stated threat model.
  2. [Experiments] Experiments section: the reported 79%-96% success rates lack details on trial counts, data splits for unseen contexts, statistical significance testing, or controls for post-hoc tuning and context selection. This prevents assessment of whether the results generalize or rely on unstated assumptions about model access and evaluation.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'high acoustic fidelity' should be accompanied by specific quantitative metrics (e.g., SNR, PESQ scores) rather than left qualitative.
  2. [Introduction] Notation: the distinction between 'context-agnostic' and prior context-dependent audio attacks could be clarified with a short formal definition or comparison table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and detailed comments on our manuscript. We address each major comment point by point below, providing clarifications and outlining planned revisions to strengthen the paper's rigor and transparency.

read point-by-point responses
  1. Referee: [Real-world studies] Real-world studies section: the claim that commercial agents can be hijacked rests on the untested assumption that the convolutional-blending perturbations survive microphone capture, speaker playback, room acoustics, and front-end preprocessing (noise suppression, compression). No quantitative ablation is provided on success-rate degradation under these conditions, which is load-bearing for the stated threat model.

    Authors: We thank the referee for this important point on validating the real-world threat model. Our real-world studies involved playing the generated adversarial audio through commercial speakers and capturing responses via standard microphones in typical indoor environments (with natural room acoustics), and the convolutional blending was explicitly designed to embed perturbations as reverberation to improve robustness to such distortions. However, we agree that the manuscript would benefit from explicit quantitative ablations on degradation factors. In the revised manuscript, we will add a dedicated ablation subsection reporting success rates under controlled variations in microphone distance, room reverberation time, background noise, and common front-end preprocessing (e.g., noise suppression and compression). This will provide direct evidence supporting the practical applicability of the attack. revision: yes

  2. Referee: [Experiments] Experiments section: the reported 79%-96% success rates lack details on trial counts, data splits for unseen contexts, statistical significance testing, or controls for post-hoc tuning and context selection. This prevents assessment of whether the results generalize or rely on unstated assumptions about model access and evaluation.

    Authors: We agree that additional methodological details are essential for reproducibility and to demonstrate that results are not artifacts of evaluation choices. The current manuscript focuses on aggregate success rates across models and categories but does not fully specify trial counts, splits, or statistical controls. In the revised version, we will expand the Experiments section to report: (i) the exact number of trials per model and misbehavior category (typically 100 independent trials), (ii) the procedure for creating unseen contexts (multi-context training with a held-out test split of context templates, ensuring no overlap), (iii) statistical significance testing (e.g., 95% confidence intervals and binomial proportion tests), and (iv) controls confirming that hyperparameters were fixed prior to evaluation on unseen contexts to avoid post-hoc selection bias. We will also clarify the threat model assumptions regarding model access (sampling-based gradient estimation requires only query access for optimization). revision: yes

Circularity Check

0 steps flagged

No circularity: empirical optimization and measured outcomes

full rationale

The paper presents AudioHijack as an empirical framework using sampling-based gradient estimation, attention supervision, multi-context training, and convolutional blending to generate adversarial audio. All central claims (79-96% success rates, hijacking on 13 LALMs, commercial-agent demonstrations) are reported as direct experimental measurements on held-out contexts and real hardware, not as quantities derived from or equivalent to fitted parameters, self-defined quantities, or prior self-citations. No equations appear that reduce predictions to inputs by construction; the work is self-contained against external benchmarks via explicit ablation-style validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the work is an empirical attack construction relying on standard optimization and audio processing techniques.

pith-pipeline@v0.9.0 · 5584 in / 1111 out tokens · 26751 ms · 2026-05-10T11:27:22.307618+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 33 canonical work pages · 8 internal anchors

  1. [1]

    Wavchat: A survey of spoken dialogue models

    S. Ji, Y . Chen, M. Fang, J. Zuo, J. Lu, H. Wang, Z. Jiang, L. Zhou, S. Liu, X. Cheng, X. Yang, Z. Wang, Q. Yang, J. Li, Y . Jiang, J. He, Y . Chu, J. Xu, and Z. Zhao, “WavChat: a Survey of Spoken Dialogue Models,”arXiv preprint, vol. arXiv:2411.13577, 2024

  2. [2]

    ChatGPT V oice Mode,

    OpenAI, “ChatGPT V oice Mode,” 2025. [Online]. Available: https://help.openai.com/en/articles/8400625-voice-mode-faq

  3. [3]

    Gemini Live: Real-time V oice Assistance from Gemini,

    Google AI, “Gemini Live: Real-time V oice Assistance from Gemini,”

  4. [4]

    Available: https://gemini.google/overview/gemini-live

    [Online]. Available: https://gemini.google/overview/gemini-live

  5. [5]

    V oxtral: Frontier Open-Source Speech Understanding Models,

    Mistral AI, “V oxtral: Frontier Open-Source Speech Understanding Models,” 2025. [Online]. Available: https://mistral.ai/news/voxtral

  6. [6]

    Empowering Innovation: The Next Gen- eration of the Phi Family,

    Microsoft Azure, “Empowering Innovation: The Next Gen- eration of the Phi Family,” 2025. [Online]. Avail- able: https://azure.microsoft.com/en-us/blog/empowering-innovation- the-next-generation-of-the-phi-family

  7. [7]

    Ultravox: Next-Gen V oice AI,

    Ultravox AI, “Ultravox: Next-Gen V oice AI,” 2025. [Online]. Available: https://www.ultravox.ai

  8. [8]

    GPT-4o: OpenAI’s new flagship model,

    OpenAI, “GPT-4o: OpenAI’s new flagship model,” 2024. [Online]. Available: https://openai.com/index/gpt-4o

  9. [9]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini Team, “Gemini 2.5: Pushing the Frontier with Advanced Rea- soning, Multimodality, Long Context, and Next Generation Agentic Capabilities,”arXiv preprint, vol. arXiv:2507.06261, 2025

  10. [10]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    Microsoft, “Phi-4-mini technical report: Compact yet powerful mul- timodal language models via mixture-of-loras,”arXiv preprint, vol. arXiv:2503.01743, 2025

  11. [11]
  12. [12]

    JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models,

    Z. Peng, Y . Liu, Z. Sun, M. Li, Z. Luo, J. Zheng, W. Dong, X. He, X. Wang, Y . Xue, S. Xu, and X. Huang, “JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models,”arXiv preprint, vol. arXiv:2505.17568, 2025

  13. [13]

    V oice jailbreak attacks against gpt-4o,

    X. Shen, Y . Wu, M. Backes, and Y . Zhang, “V oice Jailbreak Attacks Against GPT-4o,”arXiv preprint, vol. arXiv:2405.19103, 2024

  14. [14]

    Unveiling the Safety of GPT- 4o: an Empirical Study Using Jailbreak Attacks,

    Z. Ying, A. Liu, X. Liu, and D. Tao, “Unveiling the Safety of GPT- 4o: an Empirical Study Using Jailbreak Attacks,”arXiv preprint, vol. arXiv:2406.06302, 2024

  15. [15]

    Best-of-n jailbreaking

    J. Hughes, S. Price, A. Lynch, R. Schaeffer, F. Barez, S. Koyejo, H. Sleight, E. Jones, E. Perez, and M. Sharma, “Best-of-N Jailbreak- ing,”arXiv preprint, vol. arXiv:2412.03556, 2024

  16. [16]

    Unveiling typographic deceptions: Insights of the typographic vulnerability in large vision-language models

    H. Cheng, E. Xiao, J. Shao, Y . Wang, L. Yang, C. Shen, P. Torr, J. Gu, and R. Xu, “Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models,” arXiv preprint, vol. arXiv:2501.13772, 2025

  17. [17]

    AdvWave: Stealthy Adversarial Jailbreak Attack Against Large Audio-Language Models,

    M. Kang, C. Xu, and B. Li, “AdvWave: Stealthy Adversarial Jailbreak Attack Against Large Audio-Language Models,” inProceedings of ICLR, Singapore, 2025

  18. [18]

    SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models,

    R. Peri, S. M. Jayanthi, S. Ronanki, A. Bhatia, K. Mundnich, S. Dingliwal, N. Das, Z. Hou, G. Huybrechts, S. Vishnubhotla, D. Garcia-Romero, S. Srinivasan, K. J. Han, and K. Kirchhoff, “SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models,”arXiv preprint, vol. arXiv:2405.08317, 2024

  19. [19]

    AudioJailbreak: Jailbreak Attacks Against End-to-End Large Audio- Language Models,

    G. Chen, F. Song, Z. Zhao, X. Jia, Y . Liu, Y . Qiao, and W. Zhang, “AudioJailbreak: Jailbreak Attacks Against End-to-End Large Audio- Language Models,”arXiv preprint, vol. arXiv:2505.14103, 2025

  20. [20]

    Abusing images and sounds for indirect instruction injection in multi-modal llms,

    E. Bagdasaryan, T.-Y . Hsieh, B. Nassi, and V . Shmatikov, “Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs,”arXiv preprint, vol. arXiv:2307.10490, 2023

  21. [21]

    Mind the Gap: Understanding the Modality Gap in Multi-Modal Contrastive Representation Learning,

    W. Liang, Y . Zhang, Y . Kwon, S. Yeung, and J. Y . Zou, “Mind the Gap: Understanding the Modality Gap in Multi-Modal Contrastive Representation Learning,” inProceedings of NeurIPS, New Orleans, LA, USA, 2022

  22. [22]

    Devil’s Whisper: a General Approach for Physical Adver- sarial Attacks Against Commercial Black-Box Speech Recognition Devices,

    Y . Chen, X. Yuan, J. Zhang, Y . Zhao, S. Zhang, K. Chen, and X. Wang, “Devil’s Whisper: a General Approach for Physical Adver- sarial Attacks Against Commercial Black-Box Speech Recognition Devices,” inProceedings of USENIX Security, Virtual Event, 2020, pp. 2667–2684

  23. [23]

    Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems,

    G. Chen, S. Chen, L. Fan, X. Du, Z. Zhao, F. Song, and Y . Liu, “Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems,” inProceedings of IEEE S&P, San Francisco, CA, USA, 2021, pp. 694–711

  24. [24]

    Audio Adversarial Examples: Tar- geted Attacks on Speech-to-Text,

    N. Carlini and D. A. Wagner, “Audio Adversarial Examples: Tar- geted Attacks on Speech-to-Text,” inProceedings of IEEE S&P, San Francisco, CA, USA, 2018, pp. 1–7

  25. [25]

    AdvPulse: Univer- sal, Synchronization-Free, and Targeted Audio Adversarial Attacks via Subsecond Perturbations,

    Z. Li, Y . Wu, J. Liu, Y . Chen, and B. Yuan, “AdvPulse: Univer- sal, Synchronization-Free, and Targeted Audio Adversarial Attacks via Subsecond Perturbations,” inProceedings of ACM CCS, Virtual Event, USA, 2020, pp. 1121–1134

  26. [26]

    AdvReverb: Rethinking the Stealthiness of Audio Adversarial Examples to Human Perception,

    M. Chen, L. Lu, J. Yu, Z. Ba, F. Lin, and K. Ren, “AdvReverb: Rethinking the Stealthiness of Audio Adversarial Examples to Human Perception,”IEEE Trans. Inf. Forensics Secur., vol. 19, pp. 1948– 1962, 2024

  27. [27]

    Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms.arXiv preprint arXiv:2407.04051, 2024

    Tongyi Speech Team, “FunAudioLLM: V oice Understanding and Generation Foundation Models for Natural Interaction Between Hu- mans and LLMs,”arXiv preprint, vol. arXiv:2407.04051, 2024

  28. [28]

    V oice Agents,

    OpenAI, “V oice Agents,” 2024. [Online]. Available: https://platform. openai.com/docs/guides/voice-agents

  29. [29]

    SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities,

    D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y . Zhou, and X. Qiu, “SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities,” inProceedings of ACL EMNLP, Singapore, 2023, pp. 15 757–15 773

  30. [30]

    Glm-4-voice: Towards intelli- gent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612,

    A. Zeng, Z. Du, M. Liu, K. Wang, S. Jiang, L. Zhao, Y . Dong, and J. Tang, “GLM-4-V oice: Towards Intelligent and Human-Like End-to- End Spoken Chatbot,”arXiv preprint, vol. arXiv:2412.02612, 2024

  31. [31]

    Vita-audio: Fast interleaved cross-modal token generation for efficient large speech-language model

    Z. Long, Y . Shen, C. Fu, H. Gao, L. Li, P. Chen, M. Zhang, H. Shao, J. Li, J. Peng, H. Cao, K. Li, R. Ji, and X. Sun, “VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model,”arXiv preprint, vol. arXiv:2505.03739, 2025

  32. [32]

    Moshi: a speech-text foundation model for real-time dialogue

    A. D ´efossez, L. Mazar ´e, M. Orsini, A. Royer, P. P ´erez, H. J ´egou, E. Grave, and N. Zeghidour, “Moshi: a Speech-Text Foun- dation Model for Real-Time Dialogue,”arXiv preprint, vol. arXiv:2410.00037, 2024

  33. [33]

    Llama- Omni: Seamless Speech Interaction with Large Language Models,

    Q. Fang, S. Guo, Y . Zhou, Z. Ma, S. Zhang, and Y . Feng, “Llama- Omni: Seamless Speech Interaction with Large Language Models,” inProceedings of ICLR, Singapore, 2025

  34. [34]

    Llama-Omni2: LLM-Based Real-Time Spoken Chatbot with Autoregressive Stream- ing Speech Synthesis,

    Q. Fang, Y . Zhou, S. Guo, S. Zhang, and Y . Feng, “Llama-Omni2: LLM-Based Real-Time Spoken Chatbot with Autoregressive Stream- ing Speech Synthesis,” inProceedings of ACL, Vienna, Austria, 2025, pp. 18 617–18 629

  35. [35]

    Xie and C

    Z. Xie and C. Wu, “Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming,”arXiv preprint, vol. arXiv:2408.16725, 2024

  36. [36]

    Llasm: Large language and speech model.arXiv:2308.15930,

    Y . Shu, S. Dong, G. Chen, W. Huang, R. Zhang, D. Shi, Q. Xiang, and Y . Shi, “LLaSM: Large Language and Speech Model,”arXiv preprint, vol. arXiv:2308.15930, 2023

  37. [37]

    Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models,”arXiv preprint, vol. arXiv:2311.07919, 2023

  38. [38]

    Qwen2-Audio Technical Report

    Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Lin, C. Zhou, and J. Zhou, “Qwen2-Audio Technical Report,” arXiv preprint, vol. arXiv:2407.10759, 2024

  39. [39]

    Gemma 3n Model Overview,

    Google AI, “Gemma 3n Model Overview,” 2024. [Online]. Available: https://ai.google.dev/gemma/docs/gemma-3n

  40. [40]

    WavLLM: Towards Robust and Adaptive Speech Large Language Model,

    S. Hu, L. Zhou, S. Liu, S. Chen, L. Meng, H. Hao, J. Pan, X. Liu, J. Li, S. Sivasankaran, L. Liu, and F. Wei, “WavLLM: Towards Robust and Adaptive Speech Large Language Model,” inProceedings of ACL EMNLP, Miami, FL, USA, 2024, pp. 4552–4572

  41. [41]

    Audio Flamingo: a Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities,

    Z. Kong, A. Goel, R. Badlani, W. Ping, R. Valle, and B. Catanzaro, “Audio Flamingo: a Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities,” inProceedings of ACM ICML, Vienna, Austria, 2024

  42. [42]

    SALMONN: Towards Generic Hearing Abilities for Large Language Models,

    C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “SALMONN: Towards Generic Hearing Abilities for Large Language Models,” inProceedings of ICLR, Vienna, Austria, 2024

  43. [43]

    GAMA: a Large Audio- Language Model with Advanced Audio Understanding and Complex Reasoning Abilities,

    S. Ghosh, S. Kumar, A. Seth, C. K. R. Evuru, U. Tyagi, S. Sakshi, O. Nieto, R. Duraiswami, and D. Manocha, “GAMA: a Large Audio- Language Model with Advanced Audio Understanding and Complex Reasoning Abilities,” inProceedings of ACL EMNLP, Miami, FL, USA, 2024, pp. 6288–6313

  44. [44]

    Minmo: A multimodal large language model for seamless voice interaction.CoRR, abs/2501.06282, 2025

    FunAudioLLM Team, “MinMo: a Multimodal Large Language Model for Seamless V oice Interaction,”arXiv preprint, vol. arXiv:2501.06282, 2025

  45. [45]

    Kimi-Audio Technical Report

    Kimi Team, “Kimi-Audio Technical Report,”arXiv preprint, vol. arXiv:2504.18425, 2025

  46. [46]

    Lessons from defending gemini against indirect prompt injections,

    C. Shi, S. Lin, S. Song, J. Hayes, I. Shumailov, I. Yona, J. Pluto, A. Pappu, C. A. Choquette-Choo, M. Nasr, C. Sitawarin, G. Gibson, A. Terzis, and J. Flynn, “Lessons from Defending Gemini Against Indirect Prompt Injections,”arXiv preprint, vol. arXiv:2505.14534, 2025

  47. [47]

    Not What You’ve Signed Up for: Compromising Real- World LLM-Integrated Applications with Indirect Prompt Injection,

    S. Abdelnabi, K. Greshake, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not What You’ve Signed Up for: Compromising Real- World LLM-Integrated Applications with Indirect Prompt Injection,” inProceedings of ACM CCS, Copenhagen, Denmark, 2023, pp. 79– 90

  48. [48]

    Audio is the Achilles’ Heel: Red Teaming Audio Large Multimodal Models,

    H. Yang, L. Qu, E. Shareghi, and G. Haffari, “Audio is the Achilles’ Heel: Red Teaming Audio Large Multimodal Models,” inProceedings of ACL NAACL, Albuquerque, New Mexico, USA, 2025, pp. 9292– 9306

  49. [49]

    ”Do Anything Now

    X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, “”Do Anything Now”: Characterizing and Evaluating In-the-Wild Jailbreak Prompts on Large Language Models,” inProceedings of ACM CCS, Salt Lake City, UT, USA, 2024, pp. 1671–1685

  50. [50]

    Masterkey: Automated Jailbreaking of Large Language Model Chatbots,

    G. Deng, Y . Liu, Y . Li, K. Wang, Y . Zhang, Z. Li, H. Wang, T. Zhang, and Y . Liu, “Masterkey: Automated Jailbreaking of Large Language Model Chatbots,” inProceedings of ISOC NDSS, San Diego, CA, USA, 2024

  51. [51]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and Transferable Adversarial Attacks on Aligned Language Models,” arXiv preprint, vol. arXiv:2307.15043, 2023

  52. [52]

    DeepInception: Hypno- tize Large Language Model to Be Jailbreaker

    X. Li, Z. Zhou, J. Zhu, J. Yao, T. Liu, and B. Han, “DeepInception: Hypnotize Large Language Model to be Jailbreaker,”arXiv preprint, vol. arXiv:2311.03191, 2023

  53. [53]

    How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge Ai Safety by Humanizing LLMs,

    Y . Zeng, H. Lin, J. Zhang, D. Yang, R. Jia, and W. Shi, “How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge Ai Safety by Humanizing LLMs,” inProceedings of ACL, Bangkok, Thailand, 2024, pp. 14 322–14 350

  54. [55]

    Imper- ceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition,

    Y . Qin, N. Carlini, G. Cottrell, I. Goodfellow, and C. Raffel, “Imper- ceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition,” inProceedings of PMLR ICML, Long Beach, CA, USA, 2019, pp. 5231–5240

  55. [56]

    Adver- sarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding,

    L. Schonherr, K. Kohls, S. Zeiler, T. Holz, and D. Kolossa, “Adver- sarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding,” inProceedings of NDSS, San Diego, CA, 2019

  56. [57]

    PhoneyTalker: An Out-of-the- Box Toolkit for Adversarial Example Attack on Speaker Recogni- tion,

    M. Chen, L. Lu, Z. Ba, and K. Ren, “PhoneyTalker: An Out-of-the- Box Toolkit for Adversarial Example Attack on Speaker Recogni- tion,” inProceedings of IEEE INFOCOM, Virtual Event, London, United Kingdom, 2022, pp. 1419–1428

  57. [58]

    Adversarial Music: Real world Audio Adversary against Wake-word Detection System,

    J. Li, S. Qu, X. Li, J. Szurley, J. Z. Kolter, and F. Metze, “Adversarial Music: Real world Audio Adversary against Wake-word Detection System,” inProceedings of NeurIPS, Vancouver, BC, Canada, 2019, pp. 11 908–11 918

  58. [59]

    ChatGPT Record Mode,

    OpenAI, “ChatGPT Record Mode,” 2025. [Online]. Available: https://help.openai.com/en/articles/11487532-chatgpt-record

  59. [60]

    Zoom AI Companion 3.0,

    Zoom, “Zoom AI Companion 3.0,” 2025. [Online]. Available: https://www.zoom.com/en/products/ai-assistant

  60. [61]

    Openclaw: Personal ai assistant,

    OpenClaw AI, “Openclaw: Personal ai assistant,” 2026. [Online]. Available: https://openclaw.ai

  61. [62]

    Formalizing and Benchmarking Prompt Injection Attacks and Defenses,

    Y . Liu, Y . Jia, R. Geng, J. Jia, and N. Z. Gong, “Formalizing and Benchmarking Prompt Injection Attacks and Defenses,” inProceed- ings of USENIX Security, Philadelphia, PA, USA, 2024

  62. [63]

    Signed-Prompt: a New Approach to Prevent Prompt Injec- tion Attacks Against LLM-Integrated Applications,

    X. Suo, “Signed-Prompt: a New Approach to Prevent Prompt Injec- tion Attacks Against LLM-Integrated Applications,”arXiv preprint, vol. arXiv:2401.07612, 2024

  63. [64]

    Ignore Previous Prompt: Attack Techniques For Language Models

    F. Perez and I. Ribeiro, “Ignore Previous Prompt: Attack Techniques for Language Models,”arXiv preprint, vol. arXiv:2211.09527, 2022

  64. [65]

    Neural Exec: Learn- ing (and Learning From) Execution Triggers for Prompt Injection Attacks,

    D. Pasquini, M. Strohmeier, and C. Troncoso, “Neural Exec: Learn- ing (and Learning From) Execution Triggers for Prompt Injection Attacks,” inProceedings of AISec@CCS, Salt Lake City, UT, USA, 2024, pp. 89–100

  65. [66]

    Towards Evaluating the Robustness of Neural Networks,

    N. Carlini and D. A. Wagner, “Towards Evaluating the Robustness of Neural Networks,” inProceedings of IEEE S&P, Los Alamitos, CA, USA, 2017, pp. 39–57

  66. [67]

    Categorical Reparameterization with Gumbel-Softmax,

    E. Jang, S. Gu, and B. Poole, “Categorical Reparameterization with Gumbel-Softmax,” inProceedings of ICLR, Toulon, France, 2017

  67. [68]

    Air-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension,

    Q. Yang, J. Xu, W. Liu, Y . Chu, Z. Jiang, X. Zhou, Y . Leng, Y . Lv, Z. Zhao, C. Zhou, and J. Zhou, “Air-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension,” in Proceedings of ACL, Bangkok, Thailand, 2024, pp. 1979–1998

  68. [69]

    H., Pasad, A., Casanova, E., Wang, W., Fu, S.-W., Li, J., Chen, Z., Balam, J., et al

    Y . Chen, X. Yue, C. Zhang, X. Gao, R. T. Tan, and H. Li, “V oicebench: Benchmarking LLM-Based V oice Assistants,”arXiv preprint, vol. arXiv:2410.17196, 2024

  69. [70]

    A Summary of the REVERB Challenge: State-of-the-art and Remaining Challenges in Reverberant Speech Processing Research,

    K. Kinoshita, M. Delcroix, S. Gannot, E. A. P. Habets, R. Haeb- Umbach, W. Kellermann, V . Leutnant, R. Maas, T. Nakatani, B. Raj et al., “A Summary of the REVERB Challenge: State-of-the-art and Remaining Challenges in Reverberant Speech Processing Research,” EURASIP Journal on Advances in Signal Processing, vol. 2016, no. 1, p. 7, 2016

  70. [71]

    Dompteur: Taming Audio Adversarial Examples,

    T. Eisenhofer, L. Sch ¨onherr, J. Frank, L. Speckemeier, D. Kolossa, and T. Holz, “Dompteur: Taming Audio Adversarial Examples,” in Proceedings of USENIX Security, 2021, pp. 2309–2326

  71. [72]

    Earnings-22: A practical benchmark for accents in the wild,

    M. D. Rio, P. Ha, Q. McNamara, C. Miller, and S. Chandra, “Earnings-22: A Practical Benchmark for Accents in the Wild,”arXiv preprint, vol. arXiv:2203.15591, 2022

  72. [73]

    Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

    Z. Wei, Y . Wang, and Y . Wang, “Jailbreak and Guard Aligned Language Models with Only Few in-Context Demonstrations,”arXiv preprint, vol. arXiv:2310.06387, 2023

  73. [74]

    Bench- marking and Defending Against Indirect Prompt Injection Attacks on Large Language Models,

    J. Yi, Y . Xie, B. Zhu, E. Kiciman, G. Sun, X. Xie, and F. Wu, “Bench- marking and Defending Against Indirect Prompt Injection Attacks on Large Language Models,” inProceedings of ACM SIGKDD, Toronto, ON, Canada, 2025, pp. 1809–1820

  74. [75]

    LLM Self Defense: by Self Examination, LLMs Know They Are Being Tricked,

    M. Phute, A. Helbling, M. Hull, S. Peng, S. Szyller, C. Cornelius, and D. H. Chau, “LLM Self Defense: by Self Examination, LLMs Know They Are Being Tricked,” inProceedings of ICLR, Vienna, Austria, 2024

  75. [76]

    WaveGuard: Understanding and Mitigating Audio Adversarial Examples,

    S. Hussain, P. Neekhara, S. Dubnov, J. J. McAuley, and F. Koushan- far, “WaveGuard: Understanding and Mitigating Audio Adversarial Examples,” inProceedings of USENIX Security, Virtual Event, 2021, pp. 2273–2290

  76. [77]

    FraudWhistler: a Resilient, Robust and Plug-and-Play Adversarial Example Detec- tion Method for Speaker Recognition,

    K. Wang, X. Xu, L. Lu, Z. Ba, F. Lin, and K. Ren, “FraudWhistler: a Resilient, Robust and Plug-and-Play Adversarial Example Detec- tion Method for Speaker Recognition,” inProceedings of USENIX Security, Philadelphia, PA, USA, 2024, pp. 7303–7320

  77. [78]

    Characterizing Audio Adversarial Examples Using Temporal Dependency,

    Z. Yang, B. Li, P.-Y . Chen, and D. Song, “Characterizing Audio Adversarial Examples Using Temporal Dependency,” inProceedings of ICLR, New Orleans, LA, USA, 2019

  78. [79]

    UniGuardian: a Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models,

    H. Lin, Y . Lao, T. Geng, T. Yu, and W. Zhao, “UniGuardian: a Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models,”arXiv preprint, vol. arXiv:2502.13141, 2025. Appendix A. Target Behaviors and Responses As summarized in Table 7, we instantiate each misbe- havior with a set of specific target re...