pith. sign in

arxiv: 2607.02343 · v1 · pith:34LDSXMPnew · submitted 2026-07-02 · 💻 cs.SD · cs.AI

SelectTSL: Prompt-Guided Selective Target Sound Localization in Complex Scenarios

Pith reviewed 2026-07-03 05:38 UTC · model grok-4.3

classification 💻 cs.SD cs.AI
keywords sound source localizationtarget sound localizationprompt-guided attentionselective localizationdirection of arrivalmultichannel audiodeep learning audio
0
0 comments X

The pith

SelectTSL localizes only a user-specified target sound among multiple sources using prompt guidance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates prompt-guided selective target sound localization as a task that requires identifying direction only for the prompted source rather than all active sounds. It addresses the gap where standard localization finds everything without selectivity and target extraction loses spatial cues needed for direction estimates. The approach builds an end-to-end model that uses prompt information to steer phase-difference processing toward the target while also counting how many target sources are active. A sympathetic reader would care because this enables systems that respond to one designated sound in crowded or noisy settings instead of processing the entire scene. The design explicitly manages scenes where the number of target sources changes over time.

Core claim

SelectTSL is an end-to-end architecture that localizes only the user-specified target in multi-source acoustic scenes. It employs a Prompt-Guided Selective Attention Module to generate prompt-informed embeddings. These embeddings guide an inter-channel phase difference enhancer to refine raw phase cues, which are then fused with target magnitudes to jointly estimate direction of arrival and target-source cardinality.

What carries the argument

Prompt-Guided Selective Attention Module (PGSA) that produces prompt-informed embeddings to steer the IPD enhancer toward target spatial cues while preserving multichannel information.

If this is right

  • Localization occurs selectively for the prompted target and ignores other sources.
  • Direction of arrival and target-source count are estimated together in one model.
  • The system outperforms baselines on both synthetic data and real-world recordings.
  • Robust performance holds across real acoustic environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompt mechanism could be paired with source separation models to both locate and extract the designated sound.
  • Voice or text prompts might allow consumer devices to focus audio processing on one chosen source in a room.
  • The joint estimation of count and direction opens a route to tracking moving targets whose number changes.

Load-bearing premise

The prompt-guided module can create embeddings that reliably direct the phase enhancer to the correct spatial cues without discarding multichannel details.

What would settle it

Experiments on real recordings where the method fails to outperform baselines on direction estimation accuracy or loses performance when the number of target sources varies would falsify the central claim.

Figures

Figures reproduced from arXiv: 2607.02343 by Bowen Xing, Haizhou Li, Ivor W. Tsang, Xinyuan Qian, Xu-Cheng Yin, Yu Chen, Zexu Pan, Ziyang Jiang.

Figure 1
Figure 1. Figure 1: Illustration of conventional semantic-blind SSL and our proposed [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of SelectTSL. The PGSA module outputs the target magnitude spectrogram and a stack of EIEs [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Heatmap visualization of frame-wise DoA posteriors from four separately trained models under different prompt configurations. For three representative [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Polar visualization of temporal-spatial DoA trajectories. The radial axis encodes time (0–75 frames) and the angular axis encodes DoA ( [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

Humans can selectively attend to a target sound and estimate its direction in complex scenarios, whereas such selective localization remains challenging for current deep learning-based systems. Sound source localization (SSL) has achieved remarkable success with deep learning, yet most methods localize all active sources without selectivity. Conversely, target sound extraction (TSE) extracts sources using multimodal prompts but typically fails to preserve the multichannel spatial information required for accurate localization. To bridge this gap, we formulate the task of prompt-guided selective target sound localization and propose SelectTSL, an end-to-end architecture that localizes only the user-specified target in multi-source acoustic scenes. Specifically, we design a target-aware selective localization strategy that employs a Prompt-Guided Selective Attention Module (PGSA) to generate prompt-informed embeddings. These embeddings guide an inter-channel phase difference (IPD) enhancer to refine raw phase cues, fusing with target magnitudes to jointly estimate direction of arrival (DoA) and target-source cardinality, i.e., the number of target sound sources. This coupled design effectively focuses on the user-specified target spatial cues for selective localization and also handles time-varying numbers of target sources. Extensive experiments on both synthetic data and real-world recordings demonstrate that our proposed method consistently outperforms other baselines and exhibits robust generalization to real acoustic environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes SelectTSL, an end-to-end architecture for prompt-guided selective target sound localization in multi-source scenes. It introduces a Prompt-Guided Selective Attention Module (PGSA) to produce prompt-informed embeddings that guide an inter-channel phase difference (IPD) enhancer; the refined phase cues are fused with target magnitudes to jointly estimate direction of arrival (DoA) and target-source cardinality. Experiments on synthetic data and real-world recordings are reported to show consistent outperformance over baselines together with robust generalization to real acoustic environments.

Significance. If the experimental support holds, the work usefully bridges target sound extraction and sound source localization by adding prompt-based selectivity while preserving the spatial cues needed for DoA. The joint estimation of DoA and time-varying source cardinality is a constructive design choice. No machine-checked proofs or open reproducible code are mentioned.

major comments (1)
  1. [Abstract] Abstract (architecture paragraph): the claim that PGSA embeddings 'guide' the IPD enhancer and are 'fused with target magnitudes' to refine raw phase cues lacks an explicit mechanism (additive residual, channel-wise gating, or complex-valued attention). Without this, it is impossible to verify that inter-channel phase differences survive the operation, which is load-bearing for both the selectivity claim and the reported DoA accuracy on real recordings.
minor comments (1)
  1. The abstract states outperformance on synthetic and real data but supplies no quantitative metrics, error bars, baseline descriptions, or ablation results; these must appear with full experimental detail in the main text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and agree that greater clarity is warranted.

read point-by-point responses
  1. Referee: [Abstract] Abstract (architecture paragraph): the claim that PGSA embeddings 'guide' the IPD enhancer and are 'fused with target magnitudes' to refine raw phase cues lacks an explicit mechanism (additive residual, channel-wise gating, or complex-valued attention). Without this, it is impossible to verify that inter-channel phase differences survive the operation, which is load-bearing for both the selectivity claim and the reported DoA accuracy on real recordings.

    Authors: We agree that the abstract, as written, provides only a high-level description and does not specify the precise mechanism (e.g., additive residual, channel-wise gating, or complex-valued attention) by which PGSA embeddings guide the IPD enhancer or how fusion with target magnitudes occurs. This omission makes it difficult to confirm preservation of inter-channel phase differences from the abstract alone. In the revised manuscript we will update the abstract paragraph to include a concise statement of the actual mechanism used in the architecture while preserving its summary character; the detailed implementation remains in Section 3. revision: yes

Circularity Check

0 steps flagged

No circularity: end-to-end learned architecture with external evaluation

full rationale

The paper presents SelectTSL as an end-to-end neural architecture whose central claims rest on experimental outperformance on synthetic and real recordings. No equations, predictions, or uniqueness theorems are shown that reduce by construction to fitted inputs or self-citations; the PGSA and IPD enhancer are described as learned components whose behavior is validated externally rather than defined tautologically. The derivation chain is therefore self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, training details, or architectural hyperparameters are provided, so free parameters, axioms, and invented entities cannot be identified.

pith-pipeline@v0.9.1-grok · 5783 in / 1129 out tokens · 19247 ms · 2026-07-03T05:38:48.368579+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 16 canonical work pages · 4 internal anchors

  1. [1]

    Benesty, J

    J. Benesty, J. Chen, and Y . Huang,Microphone array signal processing. Springer, 2008. 14

  2. [2]

    H ¨eb-Umbach, T

    R. H ¨eb-Umbach, T. Nakatani, M. Delcroix, C. Boeddeker, and T. Ochiai, “Microphone array signal processing and deep learning for speech enhancement: Combining model-based and data-driven approaches to parameter estimation and filtering [special issue on model-based and data-driven audio signal processing],”IEEE Signal Processing Maga- zine, vol. 41, no. ...

  3. [3]

    The generalized correlation method for estimation of time delay,

    C. Knapp and G. Carter, “The generalized correlation method for estimation of time delay,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 24, no. 4, pp. 320–327, 2003

  4. [4]

    Multiple emitter location and signal parameter estimation,

    R. Schmidt, “Multiple emitter location and signal parameter estimation,” IEEE transactions on antennas and propagation, vol. 34, no. 3, pp. 276– 280, 1986

  5. [5]

    Multi-speaker doa estimation using deep convolutional networks trained with noise signals,

    S. Chakrabarty and E. A. Habets, “Multi-speaker doa estimation using deep convolutional networks trained with noise signals,”IEEE J. Sel. Topics Signal Process., vol. 13, no. 1, pp. 8–21, 2019

  6. [6]

    Some experiments on the recognition of speech, with one and with two ears,

    E. C. Cherry, “Some experiments on the recognition of speech, with one and with two ears,”J. Acoust. Soc. Am., vol. 25, no. 5, pp. 975–979, 1953

  7. [7]

    Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,

    S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,”IEEE J. Sel. Topics Signal Process., vol. 13, no. 1, pp. 34–48, 2018

  8. [8]

    Overview and evaluation of sound event localization and detection in dcase 2019,

    A. Politis, A. Mesaros, S. Adavanne, T. Heittola, and T. Virtanen, “Overview and evaluation of sound event localization and detection in dcase 2019,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 29, pp. 684–698, 2020

  9. [9]

    Toward interactive sound source localization: Better align sight and sound!

    A. Senocak, H. Ryu, J. Kim, T.-H. Oh, H. Pfister, and J. S. Chung, “Toward interactive sound source localization: Better align sight and sound!”IEEE Trans. Pattern Anal. Mach. Intell., 2025

  10. [10]

    V oiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking,

    Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. R. Hershey, R. A. Saurous, R. J. Weiss, Y . Jia, and I. L. Moreno, “V oiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking,” INTERSPEECH, pp. 2728–2732, 2019

  11. [11]

    Separate what you describe: Language-queried audio source separation,

    X. Liu, H. Liu, Q. Kong, X. Mei, J. Zhao, Q. Huang, M. D. Plumbley, and W. Wang, “Separate what you describe: Language-queried audio source separation,” inINTERSPEECH, 2022, pp. 1801–1805

  12. [12]

    Leveraging audio-only data for text-queried target sound extraction,

    K. Saijo, J. Ebbers, F. G. Germain, S. Khurana, G. Wiehern, and J. Le Roux, “Leveraging audio-only data for text-queried target sound extraction,” inICASSP. IEEE, 2025, pp. 1–5

  13. [13]

    Contextual speech extraction: Leveraging textual history as an implicit cue for target speech extraction,

    M. Kim, R. Mira, H. Chen, S. Petridis, and M. Pantic, “Contextual speech extraction: Leveraging textual history as an implicit cue for target speech extraction,” inICASSP. IEEE, 2025, pp. 1–5

  14. [14]

    Deep audio-visual speech recognition,

    T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,”IEEE Trans. Pattern Anal. Mach. Intell., 2018

  15. [15]

    An audio-visual speech separation model inspired by cortico-thalamo-cortical circuits,

    K. Li, F. Xie, H. Chen, K. Yuan, and X. Hu, “An audio-visual speech separation model inspired by cortico-thalamo-cortical circuits,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 10, pp. 6637–6651, 2024

  16. [16]

    Usev: Universal speaker extraction with visual cue,

    Z. Pan, M. Ge, and H. Li, “Usev: Universal speaker extraction with visual cue,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 30, pp. 3032–3045, 2022

  17. [17]

    Audio-visual target speaker extraction with reverse selective auditory attention,

    R. Tao, X. Qian, Y . Jiang, J. Li, J. Wang, and H. Li, “Audio-visual target speaker extraction with reverse selective auditory attention,”IEEE Trans. on Audio, Speech, and Language Processing, 2025

  18. [18]

    Neural spatial filter: Target speaker speech separation assisted with directional information

    R. Gu, L. Chen, S.-X. Zhang, J. Zheng, Y . Xu, M. Yu, D. Su, Y . Zou, and D. Yu, “Neural spatial filter: Target speaker speech separation assisted with directional information.” inINTERSPEECH, 2019, pp. 4290–4294

  19. [19]

    Towards unified all-neural beamforming for time and frequency domain speech separation,

    R. Gu, S.-X. Zhang, Y . Zou, and D. Yu, “Towards unified all-neural beamforming for time and frequency domain speech separation,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 31, pp. 849– 862, 2022

  20. [20]

    Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,

    K. ˇZmol´ıkov´a, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Bur- get, and J. ˇCernock`y, “Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,”IEEE J. Sel. Topics Signal Process., vol. 13, no. 4, pp. 800–814, 2019

  21. [21]

    Spex: Multi-scale time domain speaker extraction network,

    C. Xu, W. Rao, E. S. Chng, and H. Li, “Spex: Multi-scale time domain speaker extraction network,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 28, pp. 1370–1384, 2020

  22. [22]

    X-sepformer: End-to-end speaker extraction network with explicit optimization on speaker confusion,

    K. Liu, Z. Du, X. Wan, and H. Zhou, “X-sepformer: End-to-end speaker extraction network with explicit optimization on speaker confusion,” in ICASSP. IEEE, 2023, pp. 1–5

  23. [23]

    Clipsep: Learning text-queried sound separation with noisy unlabeled videos,

    H.-W. Dong, N. Takahashi, Y . Mitsufuji, J. McAuley, and T. Berg- Kirkpatrick, “Clipsep: Learning text-queried sound separation with noisy unlabeled videos,”arXiv preprint arXiv:2212.07065, 2022

  24. [24]

    Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,

    Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” inICASSP. IEEE, 2023, pp. 1–5

  25. [25]

    Clapsep: Lever- aging contrastive pre-trained model for multi-modal query-conditioned target sound extraction,

    H. Ma, Z. Peng, X. Li, M. Shao, X. Wu, and J. Liu, “Clapsep: Lever- aging contrastive pre-trained model for multi-modal query-conditioned target sound extraction,”IEEE Trans. on Audio, Speech, and Language Processing, 2024

  26. [26]

    Separate anything you describe,

    X. Liu, Q. Kong, Y . Zhao, H. Liu, Y . Yuan, Y . Liu, R. Xia, Y . Wang, M. D. Plumbley, and W. Wang, “Separate anything you describe,”IEEE Trans. on Audio, Speech, and Language Processing, 2024

  27. [27]

    Zero-shot audio source separation through query-based learning from weakly-labeled data,

    K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “Zero-shot audio source separation through query-based learning from weakly-labeled data,” inAAAI, vol. 36, no. 4, 2022, pp. 4441–4449

  28. [28]

    Conv-TasNet: Surpassing ideal time– frequency magnitude masking for speech separation,

    Y . Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time– frequency magnitude masking for speech separation,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019

  29. [29]

    Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,

    Y . Luo, Z. Chen, and T. Yoshioka, “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,” inICASSP. IEEE, 2020, pp. 46–50

  30. [30]

    Tf-gridnet: Integrating full-and sub-band modeling for speech separa- tion,

    Z.-Q. Wang, S. Cornell, S. Choi, Y . Lee, B.-Y . Kim, and S. Watanabe, “Tf-gridnet: Integrating full-and sub-band modeling for speech separa- tion,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 31, pp. 3221–3236, 2023

  31. [31]

    A survey of sound source localization with deep learning methods,

    P.-A. Grumiaux, S. Kiti ´c, L. Girin, and A. Gu ´erin, “A survey of sound source localization with deep learning methods,”J. Acoust. Soc. Am., vol. 152, no. 1, pp. 107–151, 2022

  32. [32]

    Learning to localize sound sources in visual scenes: Analysis and applications,

    A. Senocak, T.-H. Oh, J. Kim, M.-H. Yang, and I. S. Kweon, “Learning to localize sound sources in visual scenes: Analysis and applications,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 5, pp. 1605–1619, 2019

  33. [33]

    Robust audio-visual contrastive learning for proposal-based self-supervised sound source localization in videos,

    H. Xuan, Z. Wu, J. Yang, B. Jiang, L. Luo, X. Alameda-Pineda, and Y . Yan, “Robust audio-visual contrastive learning for proposal-based self-supervised sound source localization in videos,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 7, pp. 4896–4907, 2024

  34. [34]

    Enhancing sound source localization via false negative elimination,

    Z. Song, J. Zhang, Y . Wang, J. Fan, and Z. Zhang, “Enhancing sound source localization via false negative elimination,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 12, pp. 10 499–10 514, 2024

  35. [35]

    Salsa: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection,

    T. N. T. Nguyen, K. N. Watcharasupat, N. K. Nguyen, D. L. Jones, and W.-S. Gan, “Salsa: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 30, pp. 1749–1762, 2022

  36. [36]

    Fn-ssl: Full-band and narrow-band fusion for sound source localization,

    Y . Wang, B. Yang, and X. Li, “Fn-ssl: Full-band and narrow-band fusion for sound source localization,” inINTERSPEECH, 2023, pp. 3779–3783

  37. [37]

    Ipdnet: A universal direct-path ipd estimation network for sound source localization,

    ——, “Ipdnet: A universal direct-path ipd estimation network for sound source localization,”IEEE Trans. on Audio, Speech, and Language Processing, 2024

  38. [38]

    Salad- net: Self-attentive multisource localization in the ambisonics domain,

    P.-A. Grumiaux, S. Kiti ´c, P. Srivastava, L. Girin, and A. Gu´erin, “Salad- net: Self-attentive multisource localization in the ambisonics domain,” inIEEE Workshop Appl. Signal Process. Audio Acoust.IEEE, 2021, pp. 336–340

  39. [39]

    Swg-former: A sliding- window graph convolutional network for simultaneous spatial-temporal information extraction in sound event localization and detection,

    W. Huang, Q. Huang, L. Ma, and C. Wang, “Swg-former: A sliding- window graph convolutional network for simultaneous spatial-temporal information extraction in sound event localization and detection,”arXiv preprint arXiv:2310.14016, 2023

  40. [40]

    Sound event localization and detection for real spatial sound scenes: Event-independent network and data augmentation chains,

    J. Hu, Y . Cao, M. Wu, Q. Kong, F. Yang, M. D. Plumbley, and J. Yang, “Sound event localization and detection for real spatial sound scenes: Event-independent network and data augmentation chains,” arXiv preprint arXiv:2209.01802, 2022

  41. [41]

    Cst-former: Transformer with channel-spectro- temporal attention for sound event localization and detection,

    Y . Shul and J.-W. Choi, “Cst-former: Transformer with channel-spectro- temporal attention for sound event localization and detection,” in ICASSP. IEEE, 2024, pp. 8686–8690

  42. [42]

    Salsa-lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays,

    T. N. T. Nguyen, D. L. Jones, K. N. Watcharasupat, H. Phan, and W.-S. Gan, “Salsa-lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays,” inICASSP. IEEE, 2022, pp. 716–720

  43. [43]

    The lu system for dcase 2024 sound event localization and detection challenge,

    A. Berg, J. Engman, J. Gulin, K. Astr ¨om, M. Oskarsson, and B. Sony Eu- rope, “The lu system for dcase 2024 sound event localization and detection challenge,” DCASE2024 Challenge, Tech. Rep, Tech. Rep., 2024

  44. [44]

    Learning multi-target tdoa features for sound event localization and detection,

    A. Berg, J. Engman, J. Gulin, K. ˚Astr¨om, and M. Oskarsson, “Learning multi-target tdoa features for sound event localization and detection,” arXiv preprint arXiv:2408.17166, 2024

  45. [45]

    Srp-dnn: Learning direct-path phase difference for multiple moving sound source localization,

    B. Yang, H. Liu, and X. Li, “Srp-dnn: Learning direct-path phase difference for multiple moving sound source localization,” inICASSP. IEEE, 2022, pp. 721–725

  46. [46]

    Mimo-doanet: Multi-channel input and multiple outputs doa 15 network with unknown number of sound sources,

    H. Yin, M. Ge, Y . Fu, G. Zhang, L. Wang, L. Zhang, L. Qiu, and J. Dang, “Mimo-doanet: Multi-channel input and multiple outputs doa 15 network with unknown number of sound sources,”arXiv preprint arXiv:2207.07307, 2022

  47. [47]

    Accdoa: Activity-coupled cartesian direction of arrival representation for sound event localization and detection,

    K. Shimada, Y . Koyama, N. Takahashi, S. Takahashi, and Y . Mitsufuji, “Accdoa: Activity-coupled cartesian direction of arrival representation for sound event localization and detection,” inICASSP. IEEE, 2021, pp. 915–919

  48. [48]

    Multi-accdoa: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training,

    K. Shimada, Y . Koyama, S. Takahashi, N. Takahashi, E. Tsunoo, and Y . Mitsufuji, “Multi-accdoa: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training,” inICASSP. IEEE, 2022, pp. 316–320

  49. [49]

    Event-independent network for polyphonic sound event localization and detection,

    Y . Cao, T. Iqbal, Q. Kong, Y . Zhong, W. Wang, and M. D. Plumbley, “Event-independent network for polyphonic sound event localization and detection,”arXiv preprint arXiv:2010.00140, 2020

  50. [50]

    An improved event-independent network for polyphonic sound event localization and detection,

    Y . Cao, T. Iqbal, Q. Kong, F. An, W. Wang, and M. D. Plumbley, “An improved event-independent network for polyphonic sound event localization and detection,” inICASSP. IEEE, 2021, pp. 885–889

  51. [51]

    Mff-einv2: Multi-scale feature fusion across spectral-spatial-temporal domains for sound event localization and detection,

    D. Mu, Z. Zhang, and H. Yue, “Mff-einv2: Multi-scale feature fusion across spectral-spatial-temporal domains for sound event localization and detection,”arXiv preprint arXiv:2406.08771, 2024

  52. [52]

    Stereo sound event localization and detection with onscreen/offscreen classification,

    K. Shimada, A. Politis, I. R. Roman, P. Sudarsanam, D. Diaz-Guerra, R. Pandey, K. Uchida, Y . Koyama, N. Takahashi, T. Shibuyaet al., “Stereo sound event localization and detection with onscreen/offscreen classification,”arXiv preprint arXiv:2507.12042, 2025

  53. [53]

    Localization, Detection and Tracking of Multiple Moving Sound Sources with a Convolutional Recurrent Neural Network

    S. Adavanne, A. Politis, and T. Virtanen, “Localization, detection and tracking of multiple moving sound sources with a convolutional recurrent neural network,”arXiv preprint arXiv:1904.12769, 2019

  54. [54]

    Differentiable tracking-based training of deep learning sound source localizers,

    ——, “Differentiable tracking-based training of deep learning sound source localizers,” inIEEE Workshop Appl. Signal Process. Audio Acoust.IEEE, 2021, pp. 211–215

  55. [55]

    Keyword-based speaker localization: Localizing a target speaker in a multi-speaker environment,

    S. Sivasankaran, E. Vincent, and D. Fohr, “Keyword-based speaker localization: Localizing a target speaker in a multi-speaker environment,” inINTERSPEECH, 2018

  56. [56]

    Locate this, not that: Class-conditioned sound event doa estimation,

    O. Slizovskaia, G. Wichern, Z.-Q. Wang, and J. Le Roux, “Locate this, not that: Class-conditioned sound event doa estimation,” inICASSP. IEEE, 2022, pp. 711–715

  57. [57]

    Text-queried target sound event localization,

    J. Zhao, X. Qian, Y . Xu, H. Liu, Y . Cao, D. Berghi, and W. Wang, “Text-queried target sound event localization,” inEUSIPCO. IEEE, 2024, pp. 261–265

  58. [58]

    Locselect: Target speaker localization with an auditory selective hearing mechanism,

    Y . Chen, X. Qian, Z. Pan, K. Chen, and H. Li, “Locselect: Target speaker localization with an auditory selective hearing mechanism,” inICASSP. IEEE, 2024, pp. 8696–8700

  59. [59]

    Gcc-speaker: Target speaker localization with optimal speaker-dependent weighting in multi-speaker scenarios,

    G. Li, W. Xue, W. Liu, J. Yi, and J. Tao, “Gcc-speaker: Target speaker localization with optimal speaker-dependent weighting in multi-speaker scenarios,” inICASSP. IEEE, 2023, pp. 1–5

  60. [60]

    Librispeech: an asr corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” inICASSP. IEEE, 2015, pp. 5206–5210

  61. [61]

    A holistic evaluation of piano sound quality,

    M. Zhou, S. Wu, S. Ji, Z. Li, and W. Li, “A holistic evaluation of piano sound quality,” inNational Conf. on Sound and Music Technology. Springer, 2023, pp. 3–17

  62. [62]

    Guitarset: A dataset for guitar transcription

    Q. Xi, R. M. Bittner, J. Pauwels, X. Ye, and J. P. Bello, “Guitarset: A dataset for guitar transcription.” inISMIR, 2018, pp. 453–460

  63. [63]

    Audio set: An ontology and human-labeled dataset for audio events,

    J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” inICASSP. IEEE, 2017, pp. 776–780

  64. [64]

    Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,

    X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y . Zou, and W. Wang, “Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 32, pp. 3339– 3354, 2024

  65. [65]

    The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,

    C. K. Reddy, V . Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braunet al., “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,”arXiv preprint arXiv:2005.13981, 2020

  66. [66]

    WHAM!: Extending Speech Separation to Noisy Environments

    G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. L. Roux, “Wham!: Extending speech separation to noisy environments,”arXiv preprint arXiv:1907.01160, 2019

  67. [67]

    ESC: Dataset for Environmental Sound Classification,

    K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in ACM Int. Conf. Multimedia. ACM Press, pp. 1015–1018

  68. [68]

    A dataset and taxonomy for urban sound research,

    J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” inACM Int. Conf. Multimedia, Orlando, FL, USA, Nov. 2014, pp. 1041–1044

  69. [69]

    The qut-noise- timit corpus for evaluation of voice activity detection algorithms,

    D. Dean, S. Sridharan, R. V ogt, and M. Mason, “The qut-noise- timit corpus for evaluation of voice activity detection algorithms,” in INTERSPEECH. International Speech Communication Association, 2010, pp. 3110–3113

  70. [70]

    MUSAN: A Music, Speech, and Noise Corpus

    D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,”arXiv preprint arXiv:1510.08484, 2015

  71. [71]

    gpurir: A python library for room impulse response simulation with gpu acceleration,

    D. Diaz-Guerra, A. Miguel, and J. R. Beltran, “gpurir: A python library for room impulse response simulation with gpu acceleration,” Multimedia Tools and Applications, vol. 80, no. 4, pp. 5653–5671, 2021

  72. [72]

    A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection,

    A. Politis, S. Adavanne, and T. Virtanen, “A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection,”arXiv preprint arXiv:2006.01919, 2020

  73. [73]

    TAU Spatial Room Impulse Response Database (TAU-SRIR DB),

    A. Politis, S. Adavanne, T. Virtanen, E. Fagerlund, A. Koskimies, A. Hakala, and A. Gohar, “TAU Spatial Room Impulse Response Database (TAU-SRIR DB),” Apr. 2022

  74. [74]

    Zero-and few-shot sound event localization and detection,

    K. Shimada, K. Uchida, Y . Koyama, T. Shibuya, S. Takahashi, Y . Mit- sufuji, and T. Kawahara, “Zero-and few-shot sound event localization and detection,” inICASSP. IEEE, 2024, pp. 636–640

  75. [75]

    Two vs. four-channel sound event localization and detection,

    J. Wilkins, M. Fuentes, L. Bondi, S. Ghaffarzadegan, A. Abavisani, and J. P. Bello, “Two vs. four-channel sound event localization and detection,”arXiv preprint arXiv:2309.13343, 2023

  76. [76]

    Data augmentation and class-based ensembled cnn-conformer networks for sound event localization and detection,

    Y . Zhang, S. Wang, Z. Li, K. Guo, S. Chen, and Y . Pang, “Data augmentation and class-based ensembled cnn-conformer networks for sound event localization and detection,”Proc. DCASE, vol. 2021, 2021

  77. [77]

    Starss22: A dataset of spatial recordings of real scenes with spatiotem- poral annotations of sound events,

    A. Politis, K. Shimada, P. Sudarsanam, S. Adavanne, D. Krause, Y . Koyama, N. Takahashi, S. Takahashi, Y . Mitsufuji, and T. Virtanen, “Starss22: A dataset of spatial recordings of real scenes with spatiotem- poral annotations of sound events,”arXiv preprint arXiv:2206.01948, 2022

  78. [78]

    A metric on the space of finite sets of trajectories for evaluation of multi-target tracking algorithms,

    ´A. F. Garc´ıa-Fern´andez, A. S. Rahmathullah, and L. Svensson, “A metric on the space of finite sets of trajectories for evaluation of multi-target tracking algorithms,”IEEE Trans. on Signal Processing, vol. 68, pp. 3917–3928, 2020

  79. [79]

    Qwen2 Technical Report

    A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Xu, J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang, R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai, S. Tan, T. Zhu, T. Li, T. Liu, W....