Can Large Audio Language Models Ignore Multilingual Distractors? An Evaluation of Their Selective Auditory Attention Capabilities
Pith reviewed 2026-05-19 23:13 UTC · model grok-4.3
pith:TYNRTIPQ Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{TYNRTIPQ}
Prints a linked pith:TYNRTIPQ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Large audio language models lose selective attention to English targets when multilingual distractors appear at low signal-to-noise ratios.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Robust selective auditory attention under multilingual interference is critical for reliable deployment of Large Audio Language Models. The MUSA benchmark pairs an English target dialogue with a semantically plausible distractor in one of four languages and evaluates two closed-source and four open-weight models in single, source-separation two-stage, and end-to-end settings under controlled SNRs. Strong single-performance does not ensure robust selective auditory attention: cocktail-party accuracy degrades under severe SNRs, errors are dominated by distractor-grounded source confusion, and separation reduces acoustic overlap but leaves source attribution unresolved, often yielding confident
What carries the argument
The MUSA benchmark that constructs controlled multilingual cocktail-party mixtures and measures source-grounded spoken-language understanding across single, two-stage separation, and end-to-end regimes.
If this is right
- Cocktail-party accuracy degrades under severe SNRs for all tested models.
- Most errors arise from distractor-grounded source confusion rather than acoustic misunderstanding.
- Source separation reduces acoustic overlap but does not resolve which stream to attend to.
- Confident wrong-stream answers persist after separation in both open and closed models.
Where Pith is reading between the lines
- Training regimes that explicitly reward correct source attribution on mixed audio may be required before these models can be trusted in multi-speaker environments.
- The same attribution failure could appear in any audio task where a user intends one speaker but background talk overlaps.
- Benchmarking only on clean single-speaker data will continue to overstate readiness for real acoustic scenes.
Load-bearing premise
The chosen English target dialogues and semantically plausible distractors in four languages, presented at fixed SNRs, accurately isolate selective auditory attention without major confounding from specific content or training-data overlap.
What would settle it
A model whose end-to-end cocktail-party accuracy stays within a few points of its single-speaker accuracy even at the lowest tested SNRs and whose errors are not dominated by distractor-stream attributions.
Figures
read the original abstract
Robust selective auditory attention under multilingual interference is critical for reliable deployment of Large Audio Language Models (LALMs). We introduce MUSA, a cocktail party-inspired multilingual benchmark for source-grounded spoken-language understanding and reasoning. Each item pairs an English target dialogue with a semantically plausible distractor in English, Spanish, Korean, or Chinese, and evaluates models across (1) single, (2) source separation-based two-stage, (3) and end-to-end cocktail party settings under controlled SNRs. Evaluating two closed-source and four open-weight LALMs, we find that strong single performance does not ensure robust selective auditory attention: cocktail party accuracy degrades under severe SNRs, and errors are dominated by distractor-grounded source confusion. In addition, separation reduces acoustic overlap but leaves source attribution unresolved, often yielding confident wrong-stream answers. Data and code will be released upon publication.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the MUSA benchmark to evaluate Large Audio Language Models' selective auditory attention in multilingual cocktail-party settings. Each test item pairs an English target dialogue with a semantically plausible distractor in English, Spanish, Korean or Chinese. Models are tested in three regimes—single-stream, source-separation two-stage, and end-to-end—under controlled SNRs. The central empirical claims are that high single-stream accuracy does not imply robust attention under interference, that cocktail-party accuracy drops sharply at low SNRs, that errors are dominated by distractor-grounded source confusion, and that separation mitigates acoustic overlap but leaves source attribution unresolved.
Significance. If the benchmark construction is shown to be free of answerability confounds, the work supplies a useful, reproducible testbed for a practically important capability. The evaluation spans two closed-source and four open-weight LALMs across three processing regimes and multiple languages, and the release of data and code is a clear strength. The finding that separation improves acoustics yet fails to resolve attribution is a concrete, falsifiable observation that future model development can target.
major comments (1)
- [§3 (MUSA benchmark construction) and §4 (error analysis)] The central claim that 'errors are dominated by distractor-grounded source confusion' (abstract and §4) presupposes that every query is uniquely answerable from the target stream. The construction of semantically plausible multilingual distractors creates a non-negligible risk that some questions can be answered from distractor content alone, especially when cross-lingual semantic equivalence preserves key facts. Without an explicit validation that queries are target-exclusive (e.g., human or model checks that distractor-only audio yields the correct answer at chance), the attribution of errors to selective-attention failure rather than answerability overlap remains under-supported.
minor comments (2)
- [Abstract] The abstract states that 'Data and code will be released upon publication'; adding the exact number of items, the precise SNR values, and the statistical test used for accuracy comparisons would make the summary self-contained.
- [§3.2–3.3] In the description of the two-stage and end-to-end pipelines, clarify whether the same separation model is used for all languages or whether language-specific front-ends are employed; this detail affects interpretation of the 'separation reduces acoustic overlap' result.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address the major comment regarding potential answerability confounds in the MUSA benchmark below.
read point-by-point responses
-
Referee: [§3 (MUSA benchmark construction) and §4 (error analysis)] The central claim that 'errors are dominated by distractor-grounded source confusion' (abstract and §4) presupposes that every query is uniquely answerable from the target stream. The construction of semantically plausible multilingual distractors creates a non-negligible risk that some questions can be answered from distractor content alone, especially when cross-lingual semantic equivalence preserves key facts. Without an explicit validation that queries are target-exclusive (e.g., human or model checks that distractor-only audio yields the correct answer at chance), the attribution of errors to selective-attention failure rather than answerability overlap remains under-supported.
Authors: We thank the referee for raising this important methodological concern. While the MUSA benchmark was designed such that questions target specific information unique to the English dialogue (with distractors providing plausible but non-overlapping content), we acknowledge that an explicit validation was not reported in the original submission. To address this, we have performed additional experiments feeding only the distractor audio to the models. The resulting accuracies are at or below chance levels across all languages and models tested, indicating that the questions are not answerable from the distractor streams alone. This bolsters our claim that errors arise from source confusion in selective attention. We will update the manuscript with these results in the revised §3 and §4, including the methodology for the validation checks. revision: yes
Circularity Check
No circularity: pure empirical benchmark with no derivations or self-referential reductions
full rationale
This paper introduces the MUSA benchmark and reports empirical results on LALM performance under multilingual distractors. There are no equations, fitted parameters, mathematical derivations, or load-bearing self-citations that reduce any claim to its own inputs by construction. The central findings on accuracy degradation and distractor-grounded errors follow directly from model outputs on the constructed test items, with no self-definitional loops, imported uniqueness theorems, or ansatz smuggling. The evaluation is self-contained against external model runs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The MUSA items with controlled SNRs and semantically plausible multilingual distractors provide a valid measure of selective auditory attention.
invented entities (1)
-
MUSA benchmark
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Situational awareness , pages=
Toward a theory of situation awareness in dynamic systems , author=. Situational awareness , pages=. 2017 , publisher=
work page 2017
-
[2]
Journal of the acoustical society of America , volume=
Some experiments on the recognition of speech, with one and with two ears , author=. Journal of the acoustical society of America , volume=
-
[3]
Recent advances in speech language models: A survey , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[4]
arXiv preprint arXiv:2501.04962 , year=
Voxeval: Benchmarking the knowledge understanding capabilities of end-to-end spoken language models , author=. arXiv preprint arXiv:2501.04962 , year=
-
[5]
arXiv preprint arXiv:2505.16211 , year=
Audiotrust: Benchmarking the multifaceted trustworthiness of audio large language models , author=. arXiv preprint arXiv:2505.16211 , year=
-
[6]
2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=
Target language extraction at multilingual cocktail parties , author=. 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=. 2021 , organization=
work page 2021
-
[7]
2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=
Deep clustering: Discriminative embeddings for segmentation and separation , author=. 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2016 , organization=
work page 2016
-
[8]
Air-bench: Benchmarking large audio-language models via generative comprehension , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[9]
arXiv preprint arXiv:2005.11262 , year=
Librimix: An open-source dataset for generalizable speech separation , author=. arXiv preprint arXiv:2005.11262 , year=
-
[10]
A Cocktail-Party Benchmark: Multi-Modal Dataset and Comparative Evaluation Results , author=. ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2026 , organization=
work page 2026
-
[11]
Audiobench: A universal benchmark for audio large language models , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
work page 2025
-
[12]
SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information , author=. Proc. Interspeech 2025 , pages=
work page 2025
-
[13]
Transactions of the Association for Computational Linguistics , volume=
Voicebench: Benchmarking llm-based voice assistants , author=. Transactions of the Association for Computational Linguistics , volume=. 2026 , publisher=
work page 2026
-
[14]
CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , author=. Proc. CHiME 2020 , pages=
work page 2020
-
[15]
The cocktail party problem , author=. Current Biology , volume=. 2009 , publisher=
work page 2009
-
[16]
Audio jailbreak: An open comprehensive benchmark for jailbreaking large audio-language models
Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models , author=. arXiv preprint arXiv:2505.15406 , year=
-
[17]
Attention, Perception, & Psychophysics , volume=
The cocktail-party problem revisited: early processing and selection of multi-talker speech , author=. Attention, Perception, & Psychophysics , volume=. 2015 , publisher=
work page 2015
-
[18]
arXiv preprint arXiv:2508.21376 , year=
Ahelm: A holistic evaluation of audio-language models , author=. arXiv preprint arXiv:2508.21376 , year=
-
[19]
arXiv preprint arXiv:2505.17568 , year=
JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models , author=. arXiv preprint arXiv:2505.17568 , year=
-
[20]
Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models , author=. arXiv preprint arXiv:2501.13772 , year=
-
[21]
International conference on machine learning , pages=
Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=
work page 2023
-
[22]
arXiv preprint arXiv:2510.00628 , year=
Hearing the Order: Investigating Selection Bias in Large Audio-Language Models , author=. arXiv preprint arXiv:2510.00628 , year=
-
[23]
Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Gemini: A Family of Highly Capable Multimodal Models
Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Qwen2-audio technical report , author=. arXiv preprint arXiv:2407.10759 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Qwen2. 5-omni technical report , author=. arXiv preprint arXiv:2503.20215 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models , author=. 2024 , eprint=
work page 2024
-
[28]
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Audio flamingo 3: Advancing audio intelligence with fully open large audio language models , author=. arXiv preprint arXiv:2507.08128 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
arXiv preprint arXiv:2409.20007 , year=
DeSTA2: Developing instruction-following speech language model without speech instruction-tuning data , author=. arXiv preprint arXiv:2409.20007 , year=
-
[30]
arXiv preprint arXiv:2506.19398 , year=
ClearerVoice-Studio: Bridging Advanced Speech Processing Research and Practical Deployment , author=. arXiv preprint arXiv:2506.19398 , year=
-
[31]
The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models
The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models , author=. arXiv preprint arXiv:2601.02954 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Journal of multilingual and multicultural development , volume=
Linguistic distance: A quantitative measure of the distance between English and other languages , author=. Journal of multilingual and multicultural development , volume=. 2005 , publisher=
work page 2005
-
[33]
URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors , author=. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages=
-
[34]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
Evaluating robustness of large audio language models to audio injection: An empirical study , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2025
-
[35]
Trends in amplification , volume=
Evaluating the benefit of hearing aids in solving the cocktail party problem , author=. Trends in amplification , volume=. 2008 , publisher=
work page 2008
-
[36]
Multilingual E5 Text Embeddings: A Technical Report
Multilingual e5 text embeddings: A technical report , author=. arXiv preprint arXiv:2402.05672 , year=
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.