SelectTSL: Prompt-Guided Selective Target Sound Localization in Complex Scenarios

Bowen Xing; Haizhou Li; Ivor W. Tsang; Xinyuan Qian; Xu-Cheng Yin; Yu Chen; Zexu Pan; Ziyang Jiang

arxiv: 2607.02343 · v1 · pith:34LDSXMPnew · submitted 2026-07-02 · 💻 cs.SD · cs.AI

SelectTSL: Prompt-Guided Selective Target Sound Localization in Complex Scenarios

Ziyang Jiang , Yu Chen , Zexu Pan , Xinyuan Qian , Bowen Xing , Ivor W. Tsang , Xu-Cheng Yin , Haizhou Li This is my paper

Pith reviewed 2026-07-03 05:38 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords sound source localizationtarget sound localizationprompt-guided attentionselective localizationdirection of arrivalmultichannel audiodeep learning audio

0 comments

The pith

SelectTSL localizes only a user-specified target sound among multiple sources using prompt guidance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates prompt-guided selective target sound localization as a task that requires identifying direction only for the prompted source rather than all active sounds. It addresses the gap where standard localization finds everything without selectivity and target extraction loses spatial cues needed for direction estimates. The approach builds an end-to-end model that uses prompt information to steer phase-difference processing toward the target while also counting how many target sources are active. A sympathetic reader would care because this enables systems that respond to one designated sound in crowded or noisy settings instead of processing the entire scene. The design explicitly manages scenes where the number of target sources changes over time.

Core claim

SelectTSL is an end-to-end architecture that localizes only the user-specified target in multi-source acoustic scenes. It employs a Prompt-Guided Selective Attention Module to generate prompt-informed embeddings. These embeddings guide an inter-channel phase difference enhancer to refine raw phase cues, which are then fused with target magnitudes to jointly estimate direction of arrival and target-source cardinality.

What carries the argument

Prompt-Guided Selective Attention Module (PGSA) that produces prompt-informed embeddings to steer the IPD enhancer toward target spatial cues while preserving multichannel information.

If this is right

Localization occurs selectively for the prompted target and ignores other sources.
Direction of arrival and target-source count are estimated together in one model.
The system outperforms baselines on both synthetic data and real-world recordings.
Robust performance holds across real acoustic environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompt mechanism could be paired with source separation models to both locate and extract the designated sound.
Voice or text prompts might allow consumer devices to focus audio processing on one chosen source in a room.
The joint estimation of count and direction opens a route to tracking moving targets whose number changes.

Load-bearing premise

The prompt-guided module can create embeddings that reliably direct the phase enhancer to the correct spatial cues without discarding multichannel details.

What would settle it

Experiments on real recordings where the method fails to outperform baselines on direction estimation accuracy or loses performance when the number of target sources varies would falsify the central claim.

Figures

Figures reproduced from arXiv: 2607.02343 by Bowen Xing, Haizhou Li, Ivor W. Tsang, Xinyuan Qian, Xu-Cheng Yin, Yu Chen, Zexu Pan, Ziyang Jiang.

**Figure 2.** Figure 2: Overall architecture of SelectTSL. The PGSA module outputs the target magnitude spectrogram and a stack of EIEs [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Heatmap visualization of frame-wise DoA posteriors from four separately trained models under different prompt configurations. For three representative [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Polar visualization of temporal-spatial DoA trajectories. The radial axis encodes time (0–75 frames) and the angular axis encodes DoA ( [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Humans can selectively attend to a target sound and estimate its direction in complex scenarios, whereas such selective localization remains challenging for current deep learning-based systems. Sound source localization (SSL) has achieved remarkable success with deep learning, yet most methods localize all active sources without selectivity. Conversely, target sound extraction (TSE) extracts sources using multimodal prompts but typically fails to preserve the multichannel spatial information required for accurate localization. To bridge this gap, we formulate the task of prompt-guided selective target sound localization and propose SelectTSL, an end-to-end architecture that localizes only the user-specified target in multi-source acoustic scenes. Specifically, we design a target-aware selective localization strategy that employs a Prompt-Guided Selective Attention Module (PGSA) to generate prompt-informed embeddings. These embeddings guide an inter-channel phase difference (IPD) enhancer to refine raw phase cues, fusing with target magnitudes to jointly estimate direction of arrival (DoA) and target-source cardinality, i.e., the number of target sound sources. This coupled design effectively focuses on the user-specified target spatial cues for selective localization and also handles time-varying numbers of target sources. Extensive experiments on both synthetic data and real-world recordings demonstrate that our proposed method consistently outperforms other baselines and exhibits robust generalization to real acoustic environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a new task of prompt-guided selective target sound localization and proposes a PGSA-IPD architecture to address it, but the abstract gives no metrics or fusion details so the performance claims are hard to assess.

read the letter

The core contribution is framing selective localization as a prompt-driven task that sits between standard SSL and TSE. They introduce PGSA to create prompt-informed embeddings that steer an IPD enhancer, then fuse those with target magnitudes to output both DoA and the count of active targets. That coupling is the actual novelty; prior work either localizes everything or extracts without spatial output.

The design choice to handle variable target cardinality in one forward pass is practical and avoids separate detection stages. The claim of generalization to real recordings is the part worth checking first.

The stress-test note on phase preservation is worth taking seriously. The abstract says the embeddings "guide" the IPD enhancer and are "fused with target magnitudes," but does not spell out whether this is additive, gating, or some projection that could dilute inter-channel differences. If the guidance step collapses phase information, the downstream DoA estimate would suffer exactly where selectivity matters most.

No numbers, baselines, or ablations appear in the abstract, so the "consistent outperformance" statement cannot be evaluated yet. The architecture description is clear enough on paper, but the evidence gap is the main soft spot.

This is for groups already working on multimodal audio or spatial sound processing. It is not a foundational shift, but the task definition plus the coupled module is a reasonable incremental step. I would send it to review; the idea is coherent and the real-data claim is testable, even if the current write-up leaves the fusion mechanics and quantitative support thin.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes SelectTSL, an end-to-end architecture for prompt-guided selective target sound localization in multi-source scenes. It introduces a Prompt-Guided Selective Attention Module (PGSA) to produce prompt-informed embeddings that guide an inter-channel phase difference (IPD) enhancer; the refined phase cues are fused with target magnitudes to jointly estimate direction of arrival (DoA) and target-source cardinality. Experiments on synthetic data and real-world recordings are reported to show consistent outperformance over baselines together with robust generalization to real acoustic environments.

Significance. If the experimental support holds, the work usefully bridges target sound extraction and sound source localization by adding prompt-based selectivity while preserving the spatial cues needed for DoA. The joint estimation of DoA and time-varying source cardinality is a constructive design choice. No machine-checked proofs or open reproducible code are mentioned.

major comments (1)

[Abstract] Abstract (architecture paragraph): the claim that PGSA embeddings 'guide' the IPD enhancer and are 'fused with target magnitudes' to refine raw phase cues lacks an explicit mechanism (additive residual, channel-wise gating, or complex-valued attention). Without this, it is impossible to verify that inter-channel phase differences survive the operation, which is load-bearing for both the selectivity claim and the reported DoA accuracy on real recordings.

minor comments (1)

The abstract states outperformance on synthetic and real data but supplies no quantitative metrics, error bars, baseline descriptions, or ablation results; these must appear with full experimental detail in the main text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and agree that greater clarity is warranted.

read point-by-point responses

Referee: [Abstract] Abstract (architecture paragraph): the claim that PGSA embeddings 'guide' the IPD enhancer and are 'fused with target magnitudes' to refine raw phase cues lacks an explicit mechanism (additive residual, channel-wise gating, or complex-valued attention). Without this, it is impossible to verify that inter-channel phase differences survive the operation, which is load-bearing for both the selectivity claim and the reported DoA accuracy on real recordings.

Authors: We agree that the abstract, as written, provides only a high-level description and does not specify the precise mechanism (e.g., additive residual, channel-wise gating, or complex-valued attention) by which PGSA embeddings guide the IPD enhancer or how fusion with target magnitudes occurs. This omission makes it difficult to confirm preservation of inter-channel phase differences from the abstract alone. In the revised manuscript we will update the abstract paragraph to include a concise statement of the actual mechanism used in the architecture while preserving its summary character; the detailed implementation remains in Section 3. revision: yes

Circularity Check

0 steps flagged

No circularity: end-to-end learned architecture with external evaluation

full rationale

The paper presents SelectTSL as an end-to-end neural architecture whose central claims rest on experimental outperformance on synthetic and real recordings. No equations, predictions, or uniqueness theorems are shown that reduce by construction to fitted inputs or self-citations; the PGSA and IPD enhancer are described as learned components whose behavior is validated externally rather than defined tautologically. The derivation chain is therefore self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, training details, or architectural hyperparameters are provided, so free parameters, axioms, and invented entities cannot be identified.

pith-pipeline@v0.9.1-grok · 5783 in / 1129 out tokens · 19247 ms · 2026-07-03T05:38:48.368579+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 16 canonical work pages · 4 internal anchors

[1]

Benesty, J

J. Benesty, J. Chen, and Y . Huang,Microphone array signal processing. Springer, 2008. 14

2008
[2]

H ¨eb-Umbach, T

R. H ¨eb-Umbach, T. Nakatani, M. Delcroix, C. Boeddeker, and T. Ochiai, “Microphone array signal processing and deep learning for speech enhancement: Combining model-based and data-driven approaches to parameter estimation and filtering [special issue on model-based and data-driven audio signal processing],”IEEE Signal Processing Maga- zine, vol. 41, no. ...

2025
[3]

The generalized correlation method for estimation of time delay,

C. Knapp and G. Carter, “The generalized correlation method for estimation of time delay,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 24, no. 4, pp. 320–327, 2003

2003
[4]

Multiple emitter location and signal parameter estimation,

R. Schmidt, “Multiple emitter location and signal parameter estimation,” IEEE transactions on antennas and propagation, vol. 34, no. 3, pp. 276– 280, 1986

1986
[5]

Multi-speaker doa estimation using deep convolutional networks trained with noise signals,

S. Chakrabarty and E. A. Habets, “Multi-speaker doa estimation using deep convolutional networks trained with noise signals,”IEEE J. Sel. Topics Signal Process., vol. 13, no. 1, pp. 8–21, 2019

2019
[6]

Some experiments on the recognition of speech, with one and with two ears,

E. C. Cherry, “Some experiments on the recognition of speech, with one and with two ears,”J. Acoust. Soc. Am., vol. 25, no. 5, pp. 975–979, 1953

1953
[7]

Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,

S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,”IEEE J. Sel. Topics Signal Process., vol. 13, no. 1, pp. 34–48, 2018

2018
[8]

Overview and evaluation of sound event localization and detection in dcase 2019,

A. Politis, A. Mesaros, S. Adavanne, T. Heittola, and T. Virtanen, “Overview and evaluation of sound event localization and detection in dcase 2019,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 29, pp. 684–698, 2020

2019
[9]

Toward interactive sound source localization: Better align sight and sound!

A. Senocak, H. Ryu, J. Kim, T.-H. Oh, H. Pfister, and J. S. Chung, “Toward interactive sound source localization: Better align sight and sound!”IEEE Trans. Pattern Anal. Mach. Intell., 2025

2025
[10]

V oiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking,

Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. R. Hershey, R. A. Saurous, R. J. Weiss, Y . Jia, and I. L. Moreno, “V oiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking,” INTERSPEECH, pp. 2728–2732, 2019

2019
[11]

Separate what you describe: Language-queried audio source separation,

X. Liu, H. Liu, Q. Kong, X. Mei, J. Zhao, Q. Huang, M. D. Plumbley, and W. Wang, “Separate what you describe: Language-queried audio source separation,” inINTERSPEECH, 2022, pp. 1801–1805

2022
[12]

Leveraging audio-only data for text-queried target sound extraction,

K. Saijo, J. Ebbers, F. G. Germain, S. Khurana, G. Wiehern, and J. Le Roux, “Leveraging audio-only data for text-queried target sound extraction,” inICASSP. IEEE, 2025, pp. 1–5

2025
[13]

Contextual speech extraction: Leveraging textual history as an implicit cue for target speech extraction,

M. Kim, R. Mira, H. Chen, S. Petridis, and M. Pantic, “Contextual speech extraction: Leveraging textual history as an implicit cue for target speech extraction,” inICASSP. IEEE, 2025, pp. 1–5

2025
[14]

Deep audio-visual speech recognition,

T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,”IEEE Trans. Pattern Anal. Mach. Intell., 2018

2018
[15]

An audio-visual speech separation model inspired by cortico-thalamo-cortical circuits,

K. Li, F. Xie, H. Chen, K. Yuan, and X. Hu, “An audio-visual speech separation model inspired by cortico-thalamo-cortical circuits,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 10, pp. 6637–6651, 2024

2024
[16]

Usev: Universal speaker extraction with visual cue,

Z. Pan, M. Ge, and H. Li, “Usev: Universal speaker extraction with visual cue,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 30, pp. 3032–3045, 2022

2022
[17]

Audio-visual target speaker extraction with reverse selective auditory attention,

R. Tao, X. Qian, Y . Jiang, J. Li, J. Wang, and H. Li, “Audio-visual target speaker extraction with reverse selective auditory attention,”IEEE Trans. on Audio, Speech, and Language Processing, 2025

2025
[18]

Neural spatial filter: Target speaker speech separation assisted with directional information

R. Gu, L. Chen, S.-X. Zhang, J. Zheng, Y . Xu, M. Yu, D. Su, Y . Zou, and D. Yu, “Neural spatial filter: Target speaker speech separation assisted with directional information.” inINTERSPEECH, 2019, pp. 4290–4294

2019
[19]

Towards unified all-neural beamforming for time and frequency domain speech separation,

R. Gu, S.-X. Zhang, Y . Zou, and D. Yu, “Towards unified all-neural beamforming for time and frequency domain speech separation,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 31, pp. 849– 862, 2022

2022
[20]

Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,

K. ˇZmol´ıkov´a, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Bur- get, and J. ˇCernock`y, “Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,”IEEE J. Sel. Topics Signal Process., vol. 13, no. 4, pp. 800–814, 2019

2019
[21]

Spex: Multi-scale time domain speaker extraction network,

C. Xu, W. Rao, E. S. Chng, and H. Li, “Spex: Multi-scale time domain speaker extraction network,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 28, pp. 1370–1384, 2020

2020
[22]

X-sepformer: End-to-end speaker extraction network with explicit optimization on speaker confusion,

K. Liu, Z. Du, X. Wan, and H. Zhou, “X-sepformer: End-to-end speaker extraction network with explicit optimization on speaker confusion,” in ICASSP. IEEE, 2023, pp. 1–5

2023
[23]

Clipsep: Learning text-queried sound separation with noisy unlabeled videos,

H.-W. Dong, N. Takahashi, Y . Mitsufuji, J. McAuley, and T. Berg- Kirkpatrick, “Clipsep: Learning text-queried sound separation with noisy unlabeled videos,”arXiv preprint arXiv:2212.07065, 2022

work page arXiv 2022
[24]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,

Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” inICASSP. IEEE, 2023, pp. 1–5

2023
[25]

Clapsep: Lever- aging contrastive pre-trained model for multi-modal query-conditioned target sound extraction,

H. Ma, Z. Peng, X. Li, M. Shao, X. Wu, and J. Liu, “Clapsep: Lever- aging contrastive pre-trained model for multi-modal query-conditioned target sound extraction,”IEEE Trans. on Audio, Speech, and Language Processing, 2024

2024
[26]

Separate anything you describe,

X. Liu, Q. Kong, Y . Zhao, H. Liu, Y . Yuan, Y . Liu, R. Xia, Y . Wang, M. D. Plumbley, and W. Wang, “Separate anything you describe,”IEEE Trans. on Audio, Speech, and Language Processing, 2024

2024
[27]

Zero-shot audio source separation through query-based learning from weakly-labeled data,

K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “Zero-shot audio source separation through query-based learning from weakly-labeled data,” inAAAI, vol. 36, no. 4, 2022, pp. 4441–4449

2022
[28]

Conv-TasNet: Surpassing ideal time– frequency magnitude masking for speech separation,

Y . Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time– frequency magnitude masking for speech separation,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019

2019
[29]

Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,

Y . Luo, Z. Chen, and T. Yoshioka, “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,” inICASSP. IEEE, 2020, pp. 46–50

2020
[30]

Tf-gridnet: Integrating full-and sub-band modeling for speech separa- tion,

Z.-Q. Wang, S. Cornell, S. Choi, Y . Lee, B.-Y . Kim, and S. Watanabe, “Tf-gridnet: Integrating full-and sub-band modeling for speech separa- tion,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 31, pp. 3221–3236, 2023

2023
[31]

A survey of sound source localization with deep learning methods,

P.-A. Grumiaux, S. Kiti ´c, L. Girin, and A. Gu ´erin, “A survey of sound source localization with deep learning methods,”J. Acoust. Soc. Am., vol. 152, no. 1, pp. 107–151, 2022

2022
[32]

Learning to localize sound sources in visual scenes: Analysis and applications,

A. Senocak, T.-H. Oh, J. Kim, M.-H. Yang, and I. S. Kweon, “Learning to localize sound sources in visual scenes: Analysis and applications,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 5, pp. 1605–1619, 2019

2019
[33]

Robust audio-visual contrastive learning for proposal-based self-supervised sound source localization in videos,

H. Xuan, Z. Wu, J. Yang, B. Jiang, L. Luo, X. Alameda-Pineda, and Y . Yan, “Robust audio-visual contrastive learning for proposal-based self-supervised sound source localization in videos,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 7, pp. 4896–4907, 2024

2024
[34]

Enhancing sound source localization via false negative elimination,

Z. Song, J. Zhang, Y . Wang, J. Fan, and Z. Zhang, “Enhancing sound source localization via false negative elimination,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 12, pp. 10 499–10 514, 2024

2024
[35]

Salsa: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection,

T. N. T. Nguyen, K. N. Watcharasupat, N. K. Nguyen, D. L. Jones, and W.-S. Gan, “Salsa: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 30, pp. 1749–1762, 2022

2022
[36]

Fn-ssl: Full-band and narrow-band fusion for sound source localization,

Y . Wang, B. Yang, and X. Li, “Fn-ssl: Full-band and narrow-band fusion for sound source localization,” inINTERSPEECH, 2023, pp. 3779–3783

2023
[37]

Ipdnet: A universal direct-path ipd estimation network for sound source localization,

——, “Ipdnet: A universal direct-path ipd estimation network for sound source localization,”IEEE Trans. on Audio, Speech, and Language Processing, 2024

2024
[38]

Salad- net: Self-attentive multisource localization in the ambisonics domain,

P.-A. Grumiaux, S. Kiti ´c, P. Srivastava, L. Girin, and A. Gu´erin, “Salad- net: Self-attentive multisource localization in the ambisonics domain,” inIEEE Workshop Appl. Signal Process. Audio Acoust.IEEE, 2021, pp. 336–340

2021
[39]

Swg-former: A sliding- window graph convolutional network for simultaneous spatial-temporal information extraction in sound event localization and detection,

W. Huang, Q. Huang, L. Ma, and C. Wang, “Swg-former: A sliding- window graph convolutional network for simultaneous spatial-temporal information extraction in sound event localization and detection,”arXiv preprint arXiv:2310.14016, 2023

work page arXiv 2023
[40]

Sound event localization and detection for real spatial sound scenes: Event-independent network and data augmentation chains,

J. Hu, Y . Cao, M. Wu, Q. Kong, F. Yang, M. D. Plumbley, and J. Yang, “Sound event localization and detection for real spatial sound scenes: Event-independent network and data augmentation chains,” arXiv preprint arXiv:2209.01802, 2022

work page arXiv 2022
[41]

Cst-former: Transformer with channel-spectro- temporal attention for sound event localization and detection,

Y . Shul and J.-W. Choi, “Cst-former: Transformer with channel-spectro- temporal attention for sound event localization and detection,” in ICASSP. IEEE, 2024, pp. 8686–8690

2024
[42]

Salsa-lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays,

T. N. T. Nguyen, D. L. Jones, K. N. Watcharasupat, H. Phan, and W.-S. Gan, “Salsa-lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays,” inICASSP. IEEE, 2022, pp. 716–720

2022
[43]

The lu system for dcase 2024 sound event localization and detection challenge,

A. Berg, J. Engman, J. Gulin, K. Astr ¨om, M. Oskarsson, and B. Sony Eu- rope, “The lu system for dcase 2024 sound event localization and detection challenge,” DCASE2024 Challenge, Tech. Rep, Tech. Rep., 2024

2024
[44]

Learning multi-target tdoa features for sound event localization and detection,

A. Berg, J. Engman, J. Gulin, K. ˚Astr¨om, and M. Oskarsson, “Learning multi-target tdoa features for sound event localization and detection,” arXiv preprint arXiv:2408.17166, 2024

work page arXiv 2024
[45]

Srp-dnn: Learning direct-path phase difference for multiple moving sound source localization,

B. Yang, H. Liu, and X. Li, “Srp-dnn: Learning direct-path phase difference for multiple moving sound source localization,” inICASSP. IEEE, 2022, pp. 721–725

2022
[46]

Mimo-doanet: Multi-channel input and multiple outputs doa 15 network with unknown number of sound sources,

H. Yin, M. Ge, Y . Fu, G. Zhang, L. Wang, L. Zhang, L. Qiu, and J. Dang, “Mimo-doanet: Multi-channel input and multiple outputs doa 15 network with unknown number of sound sources,”arXiv preprint arXiv:2207.07307, 2022

work page arXiv 2022
[47]

Accdoa: Activity-coupled cartesian direction of arrival representation for sound event localization and detection,

K. Shimada, Y . Koyama, N. Takahashi, S. Takahashi, and Y . Mitsufuji, “Accdoa: Activity-coupled cartesian direction of arrival representation for sound event localization and detection,” inICASSP. IEEE, 2021, pp. 915–919

2021
[48]

Multi-accdoa: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training,

K. Shimada, Y . Koyama, S. Takahashi, N. Takahashi, E. Tsunoo, and Y . Mitsufuji, “Multi-accdoa: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training,” inICASSP. IEEE, 2022, pp. 316–320

2022
[49]

Event-independent network for polyphonic sound event localization and detection,

Y . Cao, T. Iqbal, Q. Kong, Y . Zhong, W. Wang, and M. D. Plumbley, “Event-independent network for polyphonic sound event localization and detection,”arXiv preprint arXiv:2010.00140, 2020

work page arXiv 2010
[50]

An improved event-independent network for polyphonic sound event localization and detection,

Y . Cao, T. Iqbal, Q. Kong, F. An, W. Wang, and M. D. Plumbley, “An improved event-independent network for polyphonic sound event localization and detection,” inICASSP. IEEE, 2021, pp. 885–889

2021
[51]

Mff-einv2: Multi-scale feature fusion across spectral-spatial-temporal domains for sound event localization and detection,

D. Mu, Z. Zhang, and H. Yue, “Mff-einv2: Multi-scale feature fusion across spectral-spatial-temporal domains for sound event localization and detection,”arXiv preprint arXiv:2406.08771, 2024

work page arXiv 2024
[52]

Stereo sound event localization and detection with onscreen/offscreen classification,

K. Shimada, A. Politis, I. R. Roman, P. Sudarsanam, D. Diaz-Guerra, R. Pandey, K. Uchida, Y . Koyama, N. Takahashi, T. Shibuyaet al., “Stereo sound event localization and detection with onscreen/offscreen classification,”arXiv preprint arXiv:2507.12042, 2025

work page arXiv 2025
[53]

Localization, Detection and Tracking of Multiple Moving Sound Sources with a Convolutional Recurrent Neural Network

S. Adavanne, A. Politis, and T. Virtanen, “Localization, detection and tracking of multiple moving sound sources with a convolutional recurrent neural network,”arXiv preprint arXiv:1904.12769, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[54]

Differentiable tracking-based training of deep learning sound source localizers,

——, “Differentiable tracking-based training of deep learning sound source localizers,” inIEEE Workshop Appl. Signal Process. Audio Acoust.IEEE, 2021, pp. 211–215

2021
[55]

Keyword-based speaker localization: Localizing a target speaker in a multi-speaker environment,

S. Sivasankaran, E. Vincent, and D. Fohr, “Keyword-based speaker localization: Localizing a target speaker in a multi-speaker environment,” inINTERSPEECH, 2018

2018
[56]

Locate this, not that: Class-conditioned sound event doa estimation,

O. Slizovskaia, G. Wichern, Z.-Q. Wang, and J. Le Roux, “Locate this, not that: Class-conditioned sound event doa estimation,” inICASSP. IEEE, 2022, pp. 711–715

2022
[57]

Text-queried target sound event localization,

J. Zhao, X. Qian, Y . Xu, H. Liu, Y . Cao, D. Berghi, and W. Wang, “Text-queried target sound event localization,” inEUSIPCO. IEEE, 2024, pp. 261–265

2024
[58]

Locselect: Target speaker localization with an auditory selective hearing mechanism,

Y . Chen, X. Qian, Z. Pan, K. Chen, and H. Li, “Locselect: Target speaker localization with an auditory selective hearing mechanism,” inICASSP. IEEE, 2024, pp. 8696–8700

2024
[59]

Gcc-speaker: Target speaker localization with optimal speaker-dependent weighting in multi-speaker scenarios,

G. Li, W. Xue, W. Liu, J. Yi, and J. Tao, “Gcc-speaker: Target speaker localization with optimal speaker-dependent weighting in multi-speaker scenarios,” inICASSP. IEEE, 2023, pp. 1–5

2023
[60]

Librispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” inICASSP. IEEE, 2015, pp. 5206–5210

2015
[61]

A holistic evaluation of piano sound quality,

M. Zhou, S. Wu, S. Ji, Z. Li, and W. Li, “A holistic evaluation of piano sound quality,” inNational Conf. on Sound and Music Technology. Springer, 2023, pp. 3–17

2023
[62]

Guitarset: A dataset for guitar transcription

Q. Xi, R. M. Bittner, J. Pauwels, X. Ye, and J. P. Bello, “Guitarset: A dataset for guitar transcription.” inISMIR, 2018, pp. 453–460

2018
[63]

Audio set: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” inICASSP. IEEE, 2017, pp. 776–780

2017
[64]

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,

X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y . Zou, and W. Wang, “Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 32, pp. 3339– 3354, 2024

2024
[65]

The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,

C. K. Reddy, V . Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braunet al., “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,”arXiv preprint arXiv:2005.13981, 2020

work page arXiv 2020
[66]

WHAM!: Extending Speech Separation to Noisy Environments

G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. L. Roux, “Wham!: Extending speech separation to noisy environments,”arXiv preprint arXiv:1907.01160, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[67]

ESC: Dataset for Environmental Sound Classification,

K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in ACM Int. Conf. Multimedia. ACM Press, pp. 1015–1018
[68]

A dataset and taxonomy for urban sound research,

J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” inACM Int. Conf. Multimedia, Orlando, FL, USA, Nov. 2014, pp. 1041–1044

2014
[69]

The qut-noise- timit corpus for evaluation of voice activity detection algorithms,

D. Dean, S. Sridharan, R. V ogt, and M. Mason, “The qut-noise- timit corpus for evaluation of voice activity detection algorithms,” in INTERSPEECH. International Speech Communication Association, 2010, pp. 3110–3113

2010
[70]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,”arXiv preprint arXiv:1510.08484, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[71]

gpurir: A python library for room impulse response simulation with gpu acceleration,

D. Diaz-Guerra, A. Miguel, and J. R. Beltran, “gpurir: A python library for room impulse response simulation with gpu acceleration,” Multimedia Tools and Applications, vol. 80, no. 4, pp. 5653–5671, 2021

2021
[72]

A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection,

A. Politis, S. Adavanne, and T. Virtanen, “A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection,”arXiv preprint arXiv:2006.01919, 2020

work page arXiv 2006
[73]

TAU Spatial Room Impulse Response Database (TAU-SRIR DB),

A. Politis, S. Adavanne, T. Virtanen, E. Fagerlund, A. Koskimies, A. Hakala, and A. Gohar, “TAU Spatial Room Impulse Response Database (TAU-SRIR DB),” Apr. 2022

2022
[74]

Zero-and few-shot sound event localization and detection,

K. Shimada, K. Uchida, Y . Koyama, T. Shibuya, S. Takahashi, Y . Mit- sufuji, and T. Kawahara, “Zero-and few-shot sound event localization and detection,” inICASSP. IEEE, 2024, pp. 636–640

2024
[75]

Two vs. four-channel sound event localization and detection,

J. Wilkins, M. Fuentes, L. Bondi, S. Ghaffarzadegan, A. Abavisani, and J. P. Bello, “Two vs. four-channel sound event localization and detection,”arXiv preprint arXiv:2309.13343, 2023

work page arXiv 2023
[76]

Data augmentation and class-based ensembled cnn-conformer networks for sound event localization and detection,

Y . Zhang, S. Wang, Z. Li, K. Guo, S. Chen, and Y . Pang, “Data augmentation and class-based ensembled cnn-conformer networks for sound event localization and detection,”Proc. DCASE, vol. 2021, 2021

2021
[77]

Starss22: A dataset of spatial recordings of real scenes with spatiotem- poral annotations of sound events,

A. Politis, K. Shimada, P. Sudarsanam, S. Adavanne, D. Krause, Y . Koyama, N. Takahashi, S. Takahashi, Y . Mitsufuji, and T. Virtanen, “Starss22: A dataset of spatial recordings of real scenes with spatiotem- poral annotations of sound events,”arXiv preprint arXiv:2206.01948, 2022

work page arXiv 2022
[78]

A metric on the space of finite sets of trajectories for evaluation of multi-target tracking algorithms,

´A. F. Garc´ıa-Fern´andez, A. S. Rahmathullah, and L. Svensson, “A metric on the space of finite sets of trajectories for evaluation of multi-target tracking algorithms,”IEEE Trans. on Signal Processing, vol. 68, pp. 3917–3928, 2020

2020
[79]

Qwen2 Technical Report

A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Xu, J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang, R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai, S. Tan, T. Zhu, T. Li, T. Liu, W....

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Benesty, J

J. Benesty, J. Chen, and Y . Huang,Microphone array signal processing. Springer, 2008. 14

2008

[2] [2]

H ¨eb-Umbach, T

R. H ¨eb-Umbach, T. Nakatani, M. Delcroix, C. Boeddeker, and T. Ochiai, “Microphone array signal processing and deep learning for speech enhancement: Combining model-based and data-driven approaches to parameter estimation and filtering [special issue on model-based and data-driven audio signal processing],”IEEE Signal Processing Maga- zine, vol. 41, no. ...

2025

[3] [3]

The generalized correlation method for estimation of time delay,

C. Knapp and G. Carter, “The generalized correlation method for estimation of time delay,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 24, no. 4, pp. 320–327, 2003

2003

[4] [4]

Multiple emitter location and signal parameter estimation,

R. Schmidt, “Multiple emitter location and signal parameter estimation,” IEEE transactions on antennas and propagation, vol. 34, no. 3, pp. 276– 280, 1986

1986

[5] [5]

Multi-speaker doa estimation using deep convolutional networks trained with noise signals,

S. Chakrabarty and E. A. Habets, “Multi-speaker doa estimation using deep convolutional networks trained with noise signals,”IEEE J. Sel. Topics Signal Process., vol. 13, no. 1, pp. 8–21, 2019

2019

[6] [6]

Some experiments on the recognition of speech, with one and with two ears,

E. C. Cherry, “Some experiments on the recognition of speech, with one and with two ears,”J. Acoust. Soc. Am., vol. 25, no. 5, pp. 975–979, 1953

1953

[7] [7]

Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,

S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,”IEEE J. Sel. Topics Signal Process., vol. 13, no. 1, pp. 34–48, 2018

2018

[8] [8]

Overview and evaluation of sound event localization and detection in dcase 2019,

A. Politis, A. Mesaros, S. Adavanne, T. Heittola, and T. Virtanen, “Overview and evaluation of sound event localization and detection in dcase 2019,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 29, pp. 684–698, 2020

2019

[9] [9]

Toward interactive sound source localization: Better align sight and sound!

A. Senocak, H. Ryu, J. Kim, T.-H. Oh, H. Pfister, and J. S. Chung, “Toward interactive sound source localization: Better align sight and sound!”IEEE Trans. Pattern Anal. Mach. Intell., 2025

2025

[10] [10]

V oiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking,

Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. R. Hershey, R. A. Saurous, R. J. Weiss, Y . Jia, and I. L. Moreno, “V oiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking,” INTERSPEECH, pp. 2728–2732, 2019

2019

[11] [11]

Separate what you describe: Language-queried audio source separation,

X. Liu, H. Liu, Q. Kong, X. Mei, J. Zhao, Q. Huang, M. D. Plumbley, and W. Wang, “Separate what you describe: Language-queried audio source separation,” inINTERSPEECH, 2022, pp. 1801–1805

2022

[12] [12]

Leveraging audio-only data for text-queried target sound extraction,

K. Saijo, J. Ebbers, F. G. Germain, S. Khurana, G. Wiehern, and J. Le Roux, “Leveraging audio-only data for text-queried target sound extraction,” inICASSP. IEEE, 2025, pp. 1–5

2025

[13] [13]

Contextual speech extraction: Leveraging textual history as an implicit cue for target speech extraction,

M. Kim, R. Mira, H. Chen, S. Petridis, and M. Pantic, “Contextual speech extraction: Leveraging textual history as an implicit cue for target speech extraction,” inICASSP. IEEE, 2025, pp. 1–5

2025

[14] [14]

Deep audio-visual speech recognition,

T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,”IEEE Trans. Pattern Anal. Mach. Intell., 2018

2018

[15] [15]

An audio-visual speech separation model inspired by cortico-thalamo-cortical circuits,

K. Li, F. Xie, H. Chen, K. Yuan, and X. Hu, “An audio-visual speech separation model inspired by cortico-thalamo-cortical circuits,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 10, pp. 6637–6651, 2024

2024

[16] [16]

Usev: Universal speaker extraction with visual cue,

Z. Pan, M. Ge, and H. Li, “Usev: Universal speaker extraction with visual cue,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 30, pp. 3032–3045, 2022

2022

[17] [17]

Audio-visual target speaker extraction with reverse selective auditory attention,

R. Tao, X. Qian, Y . Jiang, J. Li, J. Wang, and H. Li, “Audio-visual target speaker extraction with reverse selective auditory attention,”IEEE Trans. on Audio, Speech, and Language Processing, 2025

2025

[18] [18]

Neural spatial filter: Target speaker speech separation assisted with directional information

R. Gu, L. Chen, S.-X. Zhang, J. Zheng, Y . Xu, M. Yu, D. Su, Y . Zou, and D. Yu, “Neural spatial filter: Target speaker speech separation assisted with directional information.” inINTERSPEECH, 2019, pp. 4290–4294

2019

[19] [19]

Towards unified all-neural beamforming for time and frequency domain speech separation,

R. Gu, S.-X. Zhang, Y . Zou, and D. Yu, “Towards unified all-neural beamforming for time and frequency domain speech separation,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 31, pp. 849– 862, 2022

2022

[20] [20]

Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,

K. ˇZmol´ıkov´a, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Bur- get, and J. ˇCernock`y, “Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,”IEEE J. Sel. Topics Signal Process., vol. 13, no. 4, pp. 800–814, 2019

2019

[21] [21]

Spex: Multi-scale time domain speaker extraction network,

C. Xu, W. Rao, E. S. Chng, and H. Li, “Spex: Multi-scale time domain speaker extraction network,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 28, pp. 1370–1384, 2020

2020

[22] [22]

X-sepformer: End-to-end speaker extraction network with explicit optimization on speaker confusion,

K. Liu, Z. Du, X. Wan, and H. Zhou, “X-sepformer: End-to-end speaker extraction network with explicit optimization on speaker confusion,” in ICASSP. IEEE, 2023, pp. 1–5

2023

[23] [23]

Clipsep: Learning text-queried sound separation with noisy unlabeled videos,

H.-W. Dong, N. Takahashi, Y . Mitsufuji, J. McAuley, and T. Berg- Kirkpatrick, “Clipsep: Learning text-queried sound separation with noisy unlabeled videos,”arXiv preprint arXiv:2212.07065, 2022

work page arXiv 2022

[24] [24]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,

Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” inICASSP. IEEE, 2023, pp. 1–5

2023

[25] [25]

Clapsep: Lever- aging contrastive pre-trained model for multi-modal query-conditioned target sound extraction,

H. Ma, Z. Peng, X. Li, M. Shao, X. Wu, and J. Liu, “Clapsep: Lever- aging contrastive pre-trained model for multi-modal query-conditioned target sound extraction,”IEEE Trans. on Audio, Speech, and Language Processing, 2024

2024

[26] [26]

Separate anything you describe,

X. Liu, Q. Kong, Y . Zhao, H. Liu, Y . Yuan, Y . Liu, R. Xia, Y . Wang, M. D. Plumbley, and W. Wang, “Separate anything you describe,”IEEE Trans. on Audio, Speech, and Language Processing, 2024

2024

[27] [27]

Zero-shot audio source separation through query-based learning from weakly-labeled data,

K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “Zero-shot audio source separation through query-based learning from weakly-labeled data,” inAAAI, vol. 36, no. 4, 2022, pp. 4441–4449

2022

[28] [28]

Conv-TasNet: Surpassing ideal time– frequency magnitude masking for speech separation,

Y . Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time– frequency magnitude masking for speech separation,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019

2019

[29] [29]

Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,

Y . Luo, Z. Chen, and T. Yoshioka, “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,” inICASSP. IEEE, 2020, pp. 46–50

2020

[30] [30]

Tf-gridnet: Integrating full-and sub-band modeling for speech separa- tion,

Z.-Q. Wang, S. Cornell, S. Choi, Y . Lee, B.-Y . Kim, and S. Watanabe, “Tf-gridnet: Integrating full-and sub-band modeling for speech separa- tion,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 31, pp. 3221–3236, 2023

2023

[31] [31]

A survey of sound source localization with deep learning methods,

P.-A. Grumiaux, S. Kiti ´c, L. Girin, and A. Gu ´erin, “A survey of sound source localization with deep learning methods,”J. Acoust. Soc. Am., vol. 152, no. 1, pp. 107–151, 2022

2022

[32] [32]

Learning to localize sound sources in visual scenes: Analysis and applications,

A. Senocak, T.-H. Oh, J. Kim, M.-H. Yang, and I. S. Kweon, “Learning to localize sound sources in visual scenes: Analysis and applications,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 5, pp. 1605–1619, 2019

2019

[33] [33]

Robust audio-visual contrastive learning for proposal-based self-supervised sound source localization in videos,

H. Xuan, Z. Wu, J. Yang, B. Jiang, L. Luo, X. Alameda-Pineda, and Y . Yan, “Robust audio-visual contrastive learning for proposal-based self-supervised sound source localization in videos,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 7, pp. 4896–4907, 2024

2024

[34] [34]

Enhancing sound source localization via false negative elimination,

Z. Song, J. Zhang, Y . Wang, J. Fan, and Z. Zhang, “Enhancing sound source localization via false negative elimination,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 12, pp. 10 499–10 514, 2024

2024

[35] [35]

Salsa: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection,

T. N. T. Nguyen, K. N. Watcharasupat, N. K. Nguyen, D. L. Jones, and W.-S. Gan, “Salsa: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 30, pp. 1749–1762, 2022

2022

[36] [36]

Fn-ssl: Full-band and narrow-band fusion for sound source localization,

Y . Wang, B. Yang, and X. Li, “Fn-ssl: Full-band and narrow-band fusion for sound source localization,” inINTERSPEECH, 2023, pp. 3779–3783

2023

[37] [37]

Ipdnet: A universal direct-path ipd estimation network for sound source localization,

——, “Ipdnet: A universal direct-path ipd estimation network for sound source localization,”IEEE Trans. on Audio, Speech, and Language Processing, 2024

2024

[38] [38]

Salad- net: Self-attentive multisource localization in the ambisonics domain,

P.-A. Grumiaux, S. Kiti ´c, P. Srivastava, L. Girin, and A. Gu´erin, “Salad- net: Self-attentive multisource localization in the ambisonics domain,” inIEEE Workshop Appl. Signal Process. Audio Acoust.IEEE, 2021, pp. 336–340

2021

[39] [39]

Swg-former: A sliding- window graph convolutional network for simultaneous spatial-temporal information extraction in sound event localization and detection,

W. Huang, Q. Huang, L. Ma, and C. Wang, “Swg-former: A sliding- window graph convolutional network for simultaneous spatial-temporal information extraction in sound event localization and detection,”arXiv preprint arXiv:2310.14016, 2023

work page arXiv 2023

[40] [40]

Sound event localization and detection for real spatial sound scenes: Event-independent network and data augmentation chains,

J. Hu, Y . Cao, M. Wu, Q. Kong, F. Yang, M. D. Plumbley, and J. Yang, “Sound event localization and detection for real spatial sound scenes: Event-independent network and data augmentation chains,” arXiv preprint arXiv:2209.01802, 2022

work page arXiv 2022

[41] [41]

Cst-former: Transformer with channel-spectro- temporal attention for sound event localization and detection,

Y . Shul and J.-W. Choi, “Cst-former: Transformer with channel-spectro- temporal attention for sound event localization and detection,” in ICASSP. IEEE, 2024, pp. 8686–8690

2024

[42] [42]

Salsa-lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays,

T. N. T. Nguyen, D. L. Jones, K. N. Watcharasupat, H. Phan, and W.-S. Gan, “Salsa-lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays,” inICASSP. IEEE, 2022, pp. 716–720

2022

[43] [43]

The lu system for dcase 2024 sound event localization and detection challenge,

A. Berg, J. Engman, J. Gulin, K. Astr ¨om, M. Oskarsson, and B. Sony Eu- rope, “The lu system for dcase 2024 sound event localization and detection challenge,” DCASE2024 Challenge, Tech. Rep, Tech. Rep., 2024

2024

[44] [44]

Learning multi-target tdoa features for sound event localization and detection,

A. Berg, J. Engman, J. Gulin, K. ˚Astr¨om, and M. Oskarsson, “Learning multi-target tdoa features for sound event localization and detection,” arXiv preprint arXiv:2408.17166, 2024

work page arXiv 2024

[45] [45]

Srp-dnn: Learning direct-path phase difference for multiple moving sound source localization,

B. Yang, H. Liu, and X. Li, “Srp-dnn: Learning direct-path phase difference for multiple moving sound source localization,” inICASSP. IEEE, 2022, pp. 721–725

2022

[46] [46]

Mimo-doanet: Multi-channel input and multiple outputs doa 15 network with unknown number of sound sources,

H. Yin, M. Ge, Y . Fu, G. Zhang, L. Wang, L. Zhang, L. Qiu, and J. Dang, “Mimo-doanet: Multi-channel input and multiple outputs doa 15 network with unknown number of sound sources,”arXiv preprint arXiv:2207.07307, 2022

work page arXiv 2022

[47] [47]

Accdoa: Activity-coupled cartesian direction of arrival representation for sound event localization and detection,

K. Shimada, Y . Koyama, N. Takahashi, S. Takahashi, and Y . Mitsufuji, “Accdoa: Activity-coupled cartesian direction of arrival representation for sound event localization and detection,” inICASSP. IEEE, 2021, pp. 915–919

2021

[48] [48]

Multi-accdoa: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training,

K. Shimada, Y . Koyama, S. Takahashi, N. Takahashi, E. Tsunoo, and Y . Mitsufuji, “Multi-accdoa: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training,” inICASSP. IEEE, 2022, pp. 316–320

2022

[49] [49]

Event-independent network for polyphonic sound event localization and detection,

Y . Cao, T. Iqbal, Q. Kong, Y . Zhong, W. Wang, and M. D. Plumbley, “Event-independent network for polyphonic sound event localization and detection,”arXiv preprint arXiv:2010.00140, 2020

work page arXiv 2010

[50] [50]

An improved event-independent network for polyphonic sound event localization and detection,

Y . Cao, T. Iqbal, Q. Kong, F. An, W. Wang, and M. D. Plumbley, “An improved event-independent network for polyphonic sound event localization and detection,” inICASSP. IEEE, 2021, pp. 885–889

2021

[51] [51]

Mff-einv2: Multi-scale feature fusion across spectral-spatial-temporal domains for sound event localization and detection,

D. Mu, Z. Zhang, and H. Yue, “Mff-einv2: Multi-scale feature fusion across spectral-spatial-temporal domains for sound event localization and detection,”arXiv preprint arXiv:2406.08771, 2024

work page arXiv 2024

[52] [52]

Stereo sound event localization and detection with onscreen/offscreen classification,

K. Shimada, A. Politis, I. R. Roman, P. Sudarsanam, D. Diaz-Guerra, R. Pandey, K. Uchida, Y . Koyama, N. Takahashi, T. Shibuyaet al., “Stereo sound event localization and detection with onscreen/offscreen classification,”arXiv preprint arXiv:2507.12042, 2025

work page arXiv 2025

[53] [53]

Localization, Detection and Tracking of Multiple Moving Sound Sources with a Convolutional Recurrent Neural Network

S. Adavanne, A. Politis, and T. Virtanen, “Localization, detection and tracking of multiple moving sound sources with a convolutional recurrent neural network,”arXiv preprint arXiv:1904.12769, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[54] [54]

Differentiable tracking-based training of deep learning sound source localizers,

——, “Differentiable tracking-based training of deep learning sound source localizers,” inIEEE Workshop Appl. Signal Process. Audio Acoust.IEEE, 2021, pp. 211–215

2021

[55] [55]

Keyword-based speaker localization: Localizing a target speaker in a multi-speaker environment,

S. Sivasankaran, E. Vincent, and D. Fohr, “Keyword-based speaker localization: Localizing a target speaker in a multi-speaker environment,” inINTERSPEECH, 2018

2018

[56] [56]

Locate this, not that: Class-conditioned sound event doa estimation,

O. Slizovskaia, G. Wichern, Z.-Q. Wang, and J. Le Roux, “Locate this, not that: Class-conditioned sound event doa estimation,” inICASSP. IEEE, 2022, pp. 711–715

2022

[57] [57]

Text-queried target sound event localization,

J. Zhao, X. Qian, Y . Xu, H. Liu, Y . Cao, D. Berghi, and W. Wang, “Text-queried target sound event localization,” inEUSIPCO. IEEE, 2024, pp. 261–265

2024

[58] [58]

Locselect: Target speaker localization with an auditory selective hearing mechanism,

Y . Chen, X. Qian, Z. Pan, K. Chen, and H. Li, “Locselect: Target speaker localization with an auditory selective hearing mechanism,” inICASSP. IEEE, 2024, pp. 8696–8700

2024

[59] [59]

Gcc-speaker: Target speaker localization with optimal speaker-dependent weighting in multi-speaker scenarios,

G. Li, W. Xue, W. Liu, J. Yi, and J. Tao, “Gcc-speaker: Target speaker localization with optimal speaker-dependent weighting in multi-speaker scenarios,” inICASSP. IEEE, 2023, pp. 1–5

2023

[60] [60]

Librispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” inICASSP. IEEE, 2015, pp. 5206–5210

2015

[61] [61]

A holistic evaluation of piano sound quality,

M. Zhou, S. Wu, S. Ji, Z. Li, and W. Li, “A holistic evaluation of piano sound quality,” inNational Conf. on Sound and Music Technology. Springer, 2023, pp. 3–17

2023

[62] [62]

Guitarset: A dataset for guitar transcription

Q. Xi, R. M. Bittner, J. Pauwels, X. Ye, and J. P. Bello, “Guitarset: A dataset for guitar transcription.” inISMIR, 2018, pp. 453–460

2018

[63] [63]

Audio set: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” inICASSP. IEEE, 2017, pp. 776–780

2017

[64] [64]

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,

X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y . Zou, and W. Wang, “Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,”IEEE Trans. on Audio, Speech, and Language Processing, vol. 32, pp. 3339– 3354, 2024

2024

[65] [65]

The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,

C. K. Reddy, V . Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braunet al., “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,”arXiv preprint arXiv:2005.13981, 2020

work page arXiv 2020

[66] [66]

WHAM!: Extending Speech Separation to Noisy Environments

G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. L. Roux, “Wham!: Extending speech separation to noisy environments,”arXiv preprint arXiv:1907.01160, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[67] [67]

ESC: Dataset for Environmental Sound Classification,

K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in ACM Int. Conf. Multimedia. ACM Press, pp. 1015–1018

[68] [68]

A dataset and taxonomy for urban sound research,

J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” inACM Int. Conf. Multimedia, Orlando, FL, USA, Nov. 2014, pp. 1041–1044

2014

[69] [69]

The qut-noise- timit corpus for evaluation of voice activity detection algorithms,

D. Dean, S. Sridharan, R. V ogt, and M. Mason, “The qut-noise- timit corpus for evaluation of voice activity detection algorithms,” in INTERSPEECH. International Speech Communication Association, 2010, pp. 3110–3113

2010

[70] [70]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,”arXiv preprint arXiv:1510.08484, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[71] [71]

gpurir: A python library for room impulse response simulation with gpu acceleration,

D. Diaz-Guerra, A. Miguel, and J. R. Beltran, “gpurir: A python library for room impulse response simulation with gpu acceleration,” Multimedia Tools and Applications, vol. 80, no. 4, pp. 5653–5671, 2021

2021

[72] [72]

A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection,

A. Politis, S. Adavanne, and T. Virtanen, “A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection,”arXiv preprint arXiv:2006.01919, 2020

work page arXiv 2006

[73] [73]

TAU Spatial Room Impulse Response Database (TAU-SRIR DB),

A. Politis, S. Adavanne, T. Virtanen, E. Fagerlund, A. Koskimies, A. Hakala, and A. Gohar, “TAU Spatial Room Impulse Response Database (TAU-SRIR DB),” Apr. 2022

2022

[74] [74]

Zero-and few-shot sound event localization and detection,

K. Shimada, K. Uchida, Y . Koyama, T. Shibuya, S. Takahashi, Y . Mit- sufuji, and T. Kawahara, “Zero-and few-shot sound event localization and detection,” inICASSP. IEEE, 2024, pp. 636–640

2024

[75] [75]

Two vs. four-channel sound event localization and detection,

J. Wilkins, M. Fuentes, L. Bondi, S. Ghaffarzadegan, A. Abavisani, and J. P. Bello, “Two vs. four-channel sound event localization and detection,”arXiv preprint arXiv:2309.13343, 2023

work page arXiv 2023

[76] [76]

Data augmentation and class-based ensembled cnn-conformer networks for sound event localization and detection,

Y . Zhang, S. Wang, Z. Li, K. Guo, S. Chen, and Y . Pang, “Data augmentation and class-based ensembled cnn-conformer networks for sound event localization and detection,”Proc. DCASE, vol. 2021, 2021

2021

[77] [77]

Starss22: A dataset of spatial recordings of real scenes with spatiotem- poral annotations of sound events,

A. Politis, K. Shimada, P. Sudarsanam, S. Adavanne, D. Krause, Y . Koyama, N. Takahashi, S. Takahashi, Y . Mitsufuji, and T. Virtanen, “Starss22: A dataset of spatial recordings of real scenes with spatiotem- poral annotations of sound events,”arXiv preprint arXiv:2206.01948, 2022

work page arXiv 2022

[78] [78]

A metric on the space of finite sets of trajectories for evaluation of multi-target tracking algorithms,

´A. F. Garc´ıa-Fern´andez, A. S. Rahmathullah, and L. Svensson, “A metric on the space of finite sets of trajectories for evaluation of multi-target tracking algorithms,”IEEE Trans. on Signal Processing, vol. 68, pp. 3917–3928, 2020

2020

[79] [79]

Qwen2 Technical Report

A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Xu, J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang, R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai, S. Tan, T. Zhu, T. Li, T. Liu, W....

work page internal anchor Pith review Pith/arXiv arXiv 2024