AOR-Bench: Do Large Audio Language Models Over-Refuse Pseudo-Harmful Queries?
Pith reviewed 2026-06-26 13:20 UTC · model grok-4.3
The pith
Large audio language models often refuse benign queries when background sounds make the request harmless.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AOR-Bench shows that large audio language models display widespread over-refusal on pseudo-harmful queries, where audio that appears harmful without context becomes benign when acoustic surroundings are considered, and two simple mitigation approaches produce initial reductions in these incorrect refusals.
What carries the argument
AOR-Bench, a benchmark of 3000 pseudo-harmful audio samples across six scenario categories that isolate cases where acoustic context reverses apparent harmfulness.
If this is right
- Refusal mechanisms in audio models must incorporate acoustic context rather than relying on speech content alone.
- Safety patterns observed across model families can guide the design of more precise refusal rules.
- Lightweight methods such as chain-of-thought and activation steering offer practical starting points for lowering over-refusal.
- Real-world audio applications risk blocking legitimate user requests if context is ignored.
Where Pith is reading between the lines
- Voice assistants operating in noisy environments may need explicit context modeling to avoid rejecting routine commands.
- Safety training done only on text may not transfer well when audio carries additional meaning through sound.
- Benchmarks focused on isolated speech could understate the over-refusal problem that appears once background audio is added.
Load-bearing premise
The 3000 audio samples have been labeled correctly as cases where the full acoustic context renders the query benign rather than harmful.
What would settle it
A fresh labeling pass by multiple listeners that finds most of the 3000 samples remain harmful even with the provided acoustic context, or a direct test showing the evaluated models refuse at low rates on the benchmark.
Figures
read the original abstract
Large Audio Language Models (LALMs) have demonstrated strong performance across a wide range of audio tasks. As they are increasingly deployed in real-world applications, ensuring their safety alignment has become more important. Although refusal mechanisms serve as a key safeguard by preventing LALMs from responding to harmful requests, they can also lead to {\em over-refusal}, where models incorrectly reject benign queries. This issue is especially challenging in the audio domain because speech that appears harmful in isolation may become benign when interpreted together with the surrounding acoustic context, such as background sounds. To study this problem, we introduce \textbf{AOR-Bench} (\textbf{A}udio \textbf{O}ver-\textbf{R}efusal \textbf{Bench}mark), the first benchmark for over-refusal specifically designed for LALMs. AOR-Bench contains 3,000 pseudo-harmful audio samples across six scenario categories. Evaluating 12 representative LALMs from six major model families, we find that over-refusal is widespread (Figure~\ref{fig:overall_performance}) and uncover several important patterns in their safety judgments. As a preliminary effort to mitigate this issue, we further explore two lightweight strategies (e.g., Chain-of-Thought and activation steering) to reduce over-refusal.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AOR-Bench, the first benchmark for over-refusal in Large Audio Language Models (LALMs), containing 3,000 pseudo-harmful audio samples across six scenario categories. It evaluates 12 LALMs from six model families on these samples, reports that over-refusal is widespread, identifies patterns in safety judgments, and explores preliminary mitigation via Chain-of-Thought prompting and activation steering.
Significance. If the sample labels are verifiably correct, the work identifies a domain-specific safety failure mode in LALMs where acoustic context can neutralize apparent harm, which is relevant for real-world audio deployments. The multi-family evaluation and mitigation experiments provide an empirical foundation that could guide future alignment research, though the absence of label validation details limits immediate impact.
major comments (1)
- [Abstract] Abstract: The claim that over-refusal is widespread among the 12 evaluated LALMs rests on the assumption that each of the 3,000 samples is correctly labeled as pseudo-harmful (benign once acoustic context is included). No information is supplied on sample generation, the six scenario categories, labeling process, or any validation such as inter-annotator agreement or accuracy checks. Without this, refusal rates cannot be interpreted as over-refusal rather than appropriate refusal.
minor comments (1)
- [Abstract] Abstract: The citation to Figure~\ref{fig:overall_performance} is referenced but the figure itself and its caption are not described in the provided text, making it difficult to assess the reported performance patterns.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on benchmark transparency. We agree that additional details are needed to support interpretation of the results and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that over-refusal is widespread among the 12 evaluated LALMs rests on the assumption that each of the 3,000 samples is correctly labeled as pseudo-harmful (benign once acoustic context is included). No information is supplied on sample generation, the six scenario categories, labeling process, or any validation such as inter-annotator agreement or accuracy checks. Without this, refusal rates cannot be interpreted as over-refusal rather than appropriate refusal.
Authors: We acknowledge that the abstract provides only high-level information on AOR-Bench and that the manuscript would benefit from greater detail on construction and validation to strengthen the over-refusal claim. In the revised version we will (1) expand the abstract to summarize sample generation and labeling, (2) add explicit subsections in Section 3 describing the six scenario categories, the audio synthesis process, the criteria used to label samples as pseudo-harmful, and (3) report any validation steps performed (e.g., manual review or inter-annotator agreement). These additions will allow readers to assess whether observed refusals constitute over-refusal. revision: yes
Circularity Check
No circularity: empirical benchmark with no derivations or fitted predictions
full rationale
The paper introduces an empirical benchmark (AOR-Bench) consisting of 3,000 audio samples and evaluates 12 LALMs on refusal rates. It contains no equations, parameter fitting, derivations, or load-bearing self-citations. The central claim rests on the construction and labeling of the dataset and the observed refusal patterns, which are directly measured rather than derived from prior results by the same authors. No step reduces by construction to its own inputs, and the work is self-contained as an evaluation study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Transactions of the Association for Computational Linguistics , volume=
Know your limits: A survey of abstention in large language models , author=. Transactions of the Association for Computational Linguistics , volume=. 2025 , publisher=
2025
-
[2]
, author=
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. , author=. NeurIPS , year=
-
[3]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Recent advances in speech language models: A survey , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[4]
Audio is the achilles’ heel: Red teaming audio large multimodal models , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
2025
-
[5]
arXiv preprint arXiv:2407.15851 , year=
A survey on trustworthiness in foundation models for medical image analysis , author=. arXiv preprint arXiv:2407.15851 , year=
-
[6]
ICML , year=
Or-bench: An over-refusal benchmark for large language models , author=. ICML , year=
-
[7]
arXiv preprint arXiv:2405.13581 , year=
Safety alignment for vision language models , author=. arXiv preprint arXiv:2405.13581 , year=
-
[8]
arXiv preprint arXiv:2407.09050 , year=
Refusing Safe Prompts for Multi-modal Large Language Models , author=. arXiv preprint arXiv:2407.09050 , year=
-
[9]
arXiv preprint arXiv:2507.04250 , year=
Just enough shifts: Mitigating over-refusal in aligned language models with targeted representation fine-tuning , author=. arXiv preprint arXiv:2507.04250 , year=
-
[10]
arXiv preprint arXiv:2410.03415 , year=
Surgical, cheap, and flexible: Mitigating false refusal in language models via single vector ablation , author=. arXiv preprint arXiv:2410.03415 , year=
-
[11]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
SCANS: Mitigating the exaggerated safety for llms via safety-conscious activation steering , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[12]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Navigating the overkill in large language models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[13]
arXiv preprint arXiv:2509.19212 , year=
Steering Multimodal Large Language Models Decoding for Context-Aware Safety , author=. arXiv preprint arXiv:2509.19212 , year=
-
[14]
Scope: Scalable and adaptive evaluation of misguided safety refusal in llms , author=
-
[15]
arXiv preprint arXiv:2409.00598 , year=
Automatic pseudo-harmful prompt generation for evaluating false refusals in large language models , author=. arXiv preprint arXiv:2409.00598 , year=
-
[16]
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
Xstest: A test suite for identifying exaggerated safety behaviours in large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
2024
-
[17]
arXiv preprint arXiv:2406.17806 , year=
Mossbench: Is your multimodal language model oversensitive to safe queries? , author=. arXiv preprint arXiv:2406.17806 , year=
-
[18]
arXiv preprint arXiv:2505.23473 , year=
EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions , author=. arXiv preprint arXiv:2505.23473 , year=
-
[19]
arXiv preprint arXiv:2501.13772 , year=
Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models , author=. arXiv preprint arXiv:2501.13772 , year=
-
[20]
Audio-language models for audio-centric tasks: A survey,
Audio-language models for audio-centric tasks: A survey , author=. arXiv preprint arXiv:2501.15177 , year=
-
[21]
SARSteer: Safeguarding Large Audio-Language Models via Safe-Ablated Refusal Steering
SARSteer: Safeguarding Large Audio Language Models via Safe-Ablated Refusal Steering , author=. arXiv preprint arXiv:2510.17633 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Figstep: Jailbreaking large vision-language models via typographic visual prompts , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[23]
arXiv preprint arXiv:2505.19670 , year=
Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models , author=. arXiv preprint arXiv:2505.19670 , year=
-
[24]
ACL , year=
When Large Language Models Meet Speech: A Survey on Integration Approaches , author=. ACL , year=
-
[25]
arXiv preprint arXiv:2311.08396 , year=
Zero-shot audio captioning with audio-language model guidance and audio context keywords , author=. arXiv preprint arXiv:2311.08396 , year=
-
[26]
Audiobench: A universal benchmark for audio large language models , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
2025
-
[27]
arXiv preprint arXiv:2305.11000 , year=
Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities , author=. arXiv preprint arXiv:2305.11000 , year=
-
[28]
arXiv preprint arXiv:2505.21347 , year=
OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models , author=. arXiv preprint arXiv:2505.21347 , year=
-
[29]
Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context
Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context , author=. arXiv preprint arXiv:2601.17642 , year=
work page internal anchor Pith review arXiv
-
[30]
2024 , eprint=
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models , author=. 2024 , eprint=
2024
-
[31]
Say No Too Often: Over-Refusals in Foundation Models , author=
-
[32]
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training , author=. arXiv preprint arXiv:2505.17589 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
2023 IEEE International Conference on Big Data (BigData) , pages=
Multimodal large language models: A survey , author=. 2023 IEEE International Conference on Big Data (BigData) , pages=. 2023 , organization=
2023
-
[34]
Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
A comprehensive survey of hallucination in large language, image, video and audio foundation models , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
2024
-
[35]
2026 , eprint=
JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models , author=. 2026 , eprint=
2026
-
[36]
2025 , eprint=
Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations , author=. 2025 , eprint=
2025
-
[37]
2026 , eprint=
Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models , author=. 2026 , eprint=
2026
-
[38]
URL https://blog
Diff-in-means concept editing is worst-case optimal: Explaining a result by Sam Marks and Max Tegmark, 2023 , author=. URL https://blog. eleuther. ai/diff-in-means , year=
2023
-
[39]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Refusal in Language Models Is Mediated by a Single Direction , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[40]
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Persona vectors: Monitoring and controlling character traits in language models , author=. arXiv preprint arXiv:2507.21509 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Advances in neural information processing systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
-
[42]
As an AI language model, I cannot
“As an AI language model, I cannot”: Investigating LLM Denials of User Requests , author=. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems , pages=
2024
-
[43]
arXiv preprint arXiv:2505.18882 , year=
Personalized safety in llms: A benchmark and a planning-based agent approach , author=. arXiv preprint arXiv:2505.18882 , year=
-
[44]
GLM-5: from Vibe Coding to Agentic Engineering
Glm-5: from vibe coding to agentic engineering , author=. arXiv preprint arXiv:2602.15763 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
2026 , howpublished =
MiniMax-M2.5 , author =. 2026 , howpublished =
2026
-
[46]
2025 , howpublished =
A new era of intelligence with Gemini 3 , author =. 2025 , howpublished =
2025
-
[47]
2025 , howpublished =
Introducing Claude Haiku 4.5 , author =. 2025 , howpublished =
2025
-
[48]
Step-audio 2 technical report , author=. arXiv preprint arXiv:2507.16632 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[49]
Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
Robust Speech Recognition via Large-Scale Weak Supervision
Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya , title =. 2022 , copyright =. doi:10.48550/ARXIV.2212.04356 , url =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.04356 2022
-
[51]
, author=
From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. , author=. Interspeech , number=
-
[52]
arXiv preprint arXiv:2510.16893 , year=
Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations , author=. arXiv preprint arXiv:2510.16893 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.