pith. sign in

arxiv: 2605.17225 · v1 · pith:TYNRTIPQnew · submitted 2026-05-17 · 📡 eess.AS

Can Large Audio Language Models Ignore Multilingual Distractors? An Evaluation of Their Selective Auditory Attention Capabilities

Pith reviewed 2026-05-19 23:13 UTC · model grok-4.3

classification 📡 eess.AS
keywords Large Audio Language ModelsSelective Auditory AttentionCocktail Party ProblemMultilingual DistractorsSource SeparationSpoken Language UnderstandingBenchmark EvaluationSignal-to-Noise Ratio
0
0 comments X

The pith

Large audio language models lose selective attention to English targets when multilingual distractors appear at low signal-to-noise ratios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the MUSA benchmark to test whether large audio language models can focus on a target English dialogue while ignoring semantically plausible spoken distractors in English, Spanish, Korean, or Chinese. It compares model behavior across clean single-speaker listening, two-stage pipelines that first separate sources, and direct end-to-end cocktail-party listening at controlled noise levels. Results show that high accuracy on isolated tasks does not carry over: performance falls sharply at severe SNRs, most mistakes come from confusing which speaker is the target, and separation reduces sound overlap yet still produces confident answers drawn from the wrong stream. A sympathetic reader cares because real-world audio interfaces must handle overlapping talk in multiple languages without hallucinating from the interference.

Core claim

Robust selective auditory attention under multilingual interference is critical for reliable deployment of Large Audio Language Models. The MUSA benchmark pairs an English target dialogue with a semantically plausible distractor in one of four languages and evaluates two closed-source and four open-weight models in single, source-separation two-stage, and end-to-end settings under controlled SNRs. Strong single-performance does not ensure robust selective auditory attention: cocktail-party accuracy degrades under severe SNRs, errors are dominated by distractor-grounded source confusion, and separation reduces acoustic overlap but leaves source attribution unresolved, often yielding confident

What carries the argument

The MUSA benchmark that constructs controlled multilingual cocktail-party mixtures and measures source-grounded spoken-language understanding across single, two-stage separation, and end-to-end regimes.

If this is right

  • Cocktail-party accuracy degrades under severe SNRs for all tested models.
  • Most errors arise from distractor-grounded source confusion rather than acoustic misunderstanding.
  • Source separation reduces acoustic overlap but does not resolve which stream to attend to.
  • Confident wrong-stream answers persist after separation in both open and closed models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training regimes that explicitly reward correct source attribution on mixed audio may be required before these models can be trusted in multi-speaker environments.
  • The same attribution failure could appear in any audio task where a user intends one speaker but background talk overlaps.
  • Benchmarking only on clean single-speaker data will continue to overstate readiness for real acoustic scenes.

Load-bearing premise

The chosen English target dialogues and semantically plausible distractors in four languages, presented at fixed SNRs, accurately isolate selective auditory attention without major confounding from specific content or training-data overlap.

What would settle it

A model whose end-to-end cocktail-party accuracy stays within a few points of its single-speaker accuracy even at the lowest tested SNRs and whose errors are not dominated by distractor-stream attributions.

Figures

Figures reproduced from arXiv: 2605.17225 by Heejoon Koo.

Figure 1
Figure 1. Figure 1: Our MUSA evaluation framework. semantically salient but task-irrelevant information. Existing benchmarks address related but distinct capabilities: multi-speaker ASR and separation benchmarks focus on transcription, diarization, or signal reconstruction (Hershey et al., 2016; Watan￾abe et al., 2020; Cosentino et al., 2020; Borsdorf et al., 2021; Nguyen et al., 2026); general LALM benchmarks primarily evalu… view at source ↗
Figure 2
Figure 2. Figure 2: Average accuracy across target-to-distractor [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Robust selective auditory attention under multilingual interference is critical for reliable deployment of Large Audio Language Models (LALMs). We introduce MUSA, a cocktail party-inspired multilingual benchmark for source-grounded spoken-language understanding and reasoning. Each item pairs an English target dialogue with a semantically plausible distractor in English, Spanish, Korean, or Chinese, and evaluates models across (1) single, (2) source separation-based two-stage, (3) and end-to-end cocktail party settings under controlled SNRs. Evaluating two closed-source and four open-weight LALMs, we find that strong single performance does not ensure robust selective auditory attention: cocktail party accuracy degrades under severe SNRs, and errors are dominated by distractor-grounded source confusion. In addition, separation reduces acoustic overlap but leaves source attribution unresolved, often yielding confident wrong-stream answers. Data and code will be released upon publication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces the MUSA benchmark to evaluate Large Audio Language Models' selective auditory attention in multilingual cocktail-party settings. Each test item pairs an English target dialogue with a semantically plausible distractor in English, Spanish, Korean or Chinese. Models are tested in three regimes—single-stream, source-separation two-stage, and end-to-end—under controlled SNRs. The central empirical claims are that high single-stream accuracy does not imply robust attention under interference, that cocktail-party accuracy drops sharply at low SNRs, that errors are dominated by distractor-grounded source confusion, and that separation mitigates acoustic overlap but leaves source attribution unresolved.

Significance. If the benchmark construction is shown to be free of answerability confounds, the work supplies a useful, reproducible testbed for a practically important capability. The evaluation spans two closed-source and four open-weight LALMs across three processing regimes and multiple languages, and the release of data and code is a clear strength. The finding that separation improves acoustics yet fails to resolve attribution is a concrete, falsifiable observation that future model development can target.

major comments (1)
  1. [§3 (MUSA benchmark construction) and §4 (error analysis)] The central claim that 'errors are dominated by distractor-grounded source confusion' (abstract and §4) presupposes that every query is uniquely answerable from the target stream. The construction of semantically plausible multilingual distractors creates a non-negligible risk that some questions can be answered from distractor content alone, especially when cross-lingual semantic equivalence preserves key facts. Without an explicit validation that queries are target-exclusive (e.g., human or model checks that distractor-only audio yields the correct answer at chance), the attribution of errors to selective-attention failure rather than answerability overlap remains under-supported.
minor comments (2)
  1. [Abstract] The abstract states that 'Data and code will be released upon publication'; adding the exact number of items, the precise SNR values, and the statistical test used for accuracy comparisons would make the summary self-contained.
  2. [§3.2–3.3] In the description of the two-stage and end-to-end pipelines, clarify whether the same separation model is used for all languages or whether language-specific front-ends are employed; this detail affects interpretation of the 'separation reduces acoustic overlap' result.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment regarding potential answerability confounds in the MUSA benchmark below.

read point-by-point responses
  1. Referee: [§3 (MUSA benchmark construction) and §4 (error analysis)] The central claim that 'errors are dominated by distractor-grounded source confusion' (abstract and §4) presupposes that every query is uniquely answerable from the target stream. The construction of semantically plausible multilingual distractors creates a non-negligible risk that some questions can be answered from distractor content alone, especially when cross-lingual semantic equivalence preserves key facts. Without an explicit validation that queries are target-exclusive (e.g., human or model checks that distractor-only audio yields the correct answer at chance), the attribution of errors to selective-attention failure rather than answerability overlap remains under-supported.

    Authors: We thank the referee for raising this important methodological concern. While the MUSA benchmark was designed such that questions target specific information unique to the English dialogue (with distractors providing plausible but non-overlapping content), we acknowledge that an explicit validation was not reported in the original submission. To address this, we have performed additional experiments feeding only the distractor audio to the models. The resulting accuracies are at or below chance levels across all languages and models tested, indicating that the questions are not answerable from the distractor streams alone. This bolsters our claim that errors arise from source confusion in selective attention. We will update the manuscript with these results in the revised §3 and §4, including the methodology for the validation checks. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark with no derivations or self-referential reductions

full rationale

This paper introduces the MUSA benchmark and reports empirical results on LALM performance under multilingual distractors. There are no equations, fitted parameters, mathematical derivations, or load-bearing self-citations that reduce any claim to its own inputs by construction. The central findings on accuracy degradation and distractor-grounded errors follow directly from model outputs on the constructed test items, with no self-definitional loops, imported uniqueness theorems, or ansatz smuggling. The evaluation is self-contained against external model runs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the validity of the MUSA benchmark design and the interpretation of model errors as source confusion rather than other factors.

axioms (1)
  • domain assumption The MUSA items with controlled SNRs and semantically plausible multilingual distractors provide a valid measure of selective auditory attention.
    This assumption is invoked to justify the evaluation across single, separation-based, and end-to-end settings in the abstract.
invented entities (1)
  • MUSA benchmark no independent evidence
    purpose: To evaluate selective auditory attention capabilities of LALMs under multilingual interference.
    Newly introduced evaluation framework in this paper.

pith-pipeline@v0.9.0 · 5679 in / 1369 out tokens · 73129 ms · 2026-05-19T23:13:12.660168+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 7 internal anchors

  1. [1]

    Situational awareness , pages=

    Toward a theory of situation awareness in dynamic systems , author=. Situational awareness , pages=. 2017 , publisher=

  2. [2]

    Journal of the acoustical society of America , volume=

    Some experiments on the recognition of speech, with one and with two ears , author=. Journal of the acoustical society of America , volume=

  3. [3]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Recent advances in speech language models: A survey , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  4. [4]

    arXiv preprint arXiv:2501.04962 , year=

    Voxeval: Benchmarking the knowledge understanding capabilities of end-to-end spoken language models , author=. arXiv preprint arXiv:2501.04962 , year=

  5. [5]

    arXiv preprint arXiv:2505.16211 , year=

    Audiotrust: Benchmarking the multifaceted trustworthiness of audio large language models , author=. arXiv preprint arXiv:2505.16211 , year=

  6. [6]

    2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=

    Target language extraction at multilingual cocktail parties , author=. 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=. 2021 , organization=

  7. [7]

    2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=

    Deep clustering: Discriminative embeddings for segmentation and separation , author=. 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2016 , organization=

  8. [8]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Air-bench: Benchmarking large audio-language models via generative comprehension , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  9. [9]

    arXiv preprint arXiv:2005.11262 , year=

    Librimix: An open-source dataset for generalizable speech separation , author=. arXiv preprint arXiv:2005.11262 , year=

  10. [10]

    ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

    A Cocktail-Party Benchmark: Multi-Modal Dataset and Comparative Evaluation Results , author=. ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2026 , organization=

  11. [11]

    Audiobench: A universal benchmark for audio large language models , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  12. [12]

    SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information , author=. Proc. Interspeech 2025 , pages=

  13. [13]

    Transactions of the Association for Computational Linguistics , volume=

    Voicebench: Benchmarking llm-based voice assistants , author=. Transactions of the Association for Computational Linguistics , volume=. 2026 , publisher=

  14. [14]

    CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , author=. Proc. CHiME 2020 , pages=

  15. [15]

    Current Biology , volume=

    The cocktail party problem , author=. Current Biology , volume=. 2009 , publisher=

  16. [16]

    Audio jailbreak: An open comprehensive benchmark for jailbreaking large audio-language models

    Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models , author=. arXiv preprint arXiv:2505.15406 , year=

  17. [17]

    Attention, Perception, & Psychophysics , volume=

    The cocktail-party problem revisited: early processing and selection of multi-talker speech , author=. Attention, Perception, & Psychophysics , volume=. 2015 , publisher=

  18. [18]

    arXiv preprint arXiv:2508.21376 , year=

    Ahelm: A holistic evaluation of audio-language models , author=. arXiv preprint arXiv:2508.21376 , year=

  19. [19]

    arXiv preprint arXiv:2505.17568 , year=

    JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models , author=. arXiv preprint arXiv:2505.17568 , year=

  20. [20]

    Tune in, act up: Exploring the impact of audio modality-specific edits on large audio language models in jailbreak

    Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models , author=. arXiv preprint arXiv:2501.13772 , year=

  21. [21]

    International conference on machine learning , pages=

    Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

  22. [22]

    arXiv preprint arXiv:2510.00628 , year=

    Hearing the Order: Investigating Selection Bias in Large Audio-Language Models , author=. arXiv preprint arXiv:2510.00628 , year=

  23. [23]

    GPT-4o System Card

    Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

  24. [24]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

  25. [25]

    Qwen2-Audio Technical Report

    Qwen2-audio technical report , author=. arXiv preprint arXiv:2407.10759 , year=

  26. [26]

    Qwen2.5-Omni Technical Report

    Qwen2. 5-omni technical report , author=. arXiv preprint arXiv:2503.20215 , year=

  27. [27]

    2024 , eprint=

    MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models , author=. 2024 , eprint=

  28. [28]

    Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

    Audio flamingo 3: Advancing audio intelligence with fully open large audio language models , author=. arXiv preprint arXiv:2507.08128 , year=

  29. [29]

    arXiv preprint arXiv:2409.20007 , year=

    DeSTA2: Developing instruction-following speech language model without speech instruction-tuning data , author=. arXiv preprint arXiv:2409.20007 , year=

  30. [30]

    arXiv preprint arXiv:2506.19398 , year=

    ClearerVoice-Studio: Bridging Advanced Speech Processing Research and Practical Deployment , author=. arXiv preprint arXiv:2506.19398 , year=

  31. [31]

    The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models

    The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models , author=. arXiv preprint arXiv:2601.02954 , year=

  32. [32]

    Journal of multilingual and multicultural development , volume=

    Linguistic distance: A quantitative measure of the distance between English and other languages , author=. Journal of multilingual and multicultural development , volume=. 2005 , publisher=

  33. [33]

    Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages=

    URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors , author=. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages=

  34. [34]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Evaluating robustness of large audio language models to audio injection: An empirical study , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  35. [35]

    Trends in amplification , volume=

    Evaluating the benefit of hearing aids in solving the cocktail party problem , author=. Trends in amplification , volume=. 2008 , publisher=

  36. [36]

    Multilingual E5 Text Embeddings: A Technical Report

    Multilingual e5 text embeddings: A technical report , author=. arXiv preprint arXiv:2402.05672 , year=