NeuralMUSIC: A Hybrid Neural-Subspace Framework for Robot Sound Source Localization

Junqiao Fan; Lihua Xie; Shenghai Yuan; Yizhuo Yang

arxiv: 2606.18664 · v1 · pith:HPXZJ6DYnew · submitted 2026-06-17 · 💻 cs.SD · cs.AI

NeuralMUSIC: A Hybrid Neural-Subspace Framework for Robot Sound Source Localization

Yizhuo Yang , Junqiao Fan , Shenghai Yuan , Lihua Xie This is my paper

Pith reviewed 2026-06-26 19:59 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords sound source localizationrobot auditionMUSIC algorithmcovariance matrix estimationdirection of arrivalneural networkself-supervised learninghybrid method

0 comments

The pith

A neural network estimates the spatial covariance matrix to feed into the classical MUSIC algorithm for robot sound source localization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a neural network can predict the spatial covariance matrix from multichannel microphone signals, allowing the predicted matrix to be substituted directly into the MUSIC pipeline for direction-of-arrival estimation. This hybrid approach includes eigenvalue decomposition, pseudo-spectrum computation, a Frequency Attention Fusion module, and a self-supervised strategy that trains on unlabeled acoustic data to learn spatial correlations. A sympathetic reader would care because classical MUSIC degrades at low signal-to-noise ratios while pure deep-learning methods often fail to generalize across robotic conditions; a method that combines the two could deliver reliable spatial perception for autonomous robots operating in noisy, changing environments.

Core claim

NeuralMUSIC first trains a neural network to estimate the spatial covariance matrix from raw multichannel observations, substitutes the estimate into the standard MUSIC procedure of eigenvalue decomposition and pseudo-spectrum calculation, applies a Frequency Attention Fusion module to combine information across frequencies, and augments training with Self-supervised Spatial Correlation Learning on unlabeled data; experiments across robotic tasks show the resulting system reaches competitive localization accuracy together with gains in robustness and cross-domain generalization.

What carries the argument

NeuralMUSIC hybrid framework: neural estimation of the spatial covariance matrix inserted into the MUSIC subspace pipeline, augmented by Frequency Attention Fusion and Self-supervised Spatial Correlation Learning.

If this is right

Robots obtain usable direction-of-arrival estimates even when microphone signals are noisy enough to break classical covariance estimation.
The same model can be deployed across different robotic platforms and acoustic environments without retraining from scratch.
Unlabeled recordings collected during normal robot operation become a training resource that improves spatial correlation learning.
The hybrid pipeline retains the theoretical interpretability of eigenvalue-based subspace methods while gaining data-driven robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the covariance estimate is the main performance bottleneck, similar neural substitution could be tested on other subspace techniques such as ESPRIT.
The self-supervised spatial correlation objective might transfer to tasks like acoustic scene analysis or multi-robot coordination where labeled direction data are scarce.
In very low-data regimes the method could be combined with physics-informed constraints on the covariance structure to further reduce reliance on labeled examples.

Load-bearing premise

The neural network must produce a covariance-matrix estimate accurate enough that feeding it into the MUSIC pipeline yields performance at least as good as using the true covariance under the operating conditions of interest.

What would settle it

Measure whether the localization error of NeuralMUSIC on held-out robotic recordings at varying signal-to-noise ratios equals or exceeds that of classical MUSIC supplied with the true covariance matrix computed from the same data.

Figures

Figures reproduced from arXiv: 2606.18664 by Junqiao Fan, Lihua Xie, Shenghai Yuan, Yizhuo Yang.

**Figure 2.** Figure 2: Overall architecture of the proposed hybrid neural–subspace framework for robot audition. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: (a). ReSpeaker microphone array used in the GSC ex [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison under different source configurations on AV16.3 dataset. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Performance versus different training sample ratios. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 7.** Figure 7: Illustration of the three rooms and microphone array [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Masking strategy used in the proposed SSCL module [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: The predicted spectrum for different methods under [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 11.** Figure 11: It can be observed that models initialized with SSCL [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 10.** Figure 10: Predicted DOA spectra of different methods under varying numbers of sound sources. [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Training loss curve under different training data ratio with (w/) and without (w/o) SSCL [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

read the original abstract

Reliable sound source localization is fundamental to robot audition, enabling autonomous robots to perceive spatial cues and operate effectively in dynamic environments. Classical methods such as Multiple Signal Classification (MUSIC) offer strong theoretical foundations but degrade under low signal-to-noise ratios. While deep learning-based approaches achieve promising performance, they often struggle with limited generalization across conditions. To address these challenges, we propose NeuralMUSIC, a hybrid neural-subspace framework for robotic sound source localization. Specifically, a neural network first estimates the spatial covariance matrix from multichannel microphone observations. The predicted covariance is then integrated into a classical MUSIC pipeline with eigenvalue decomposition (EVD) and pseudo-spectrum computation, followed by a Frequency Attention Fusion (FAF) module to produce the final DOA estimates. To improve data efficiency, we further introduce a Self-supervised Spatial Correlation Learning (SSCL) strategy that leverages unlabeled acoustic data to capture spatial structure. Extensive experiments across different robotic tasks demonstrate that NeuralMUSIC achieves competitive localization accuracy while exhibiting improved robustness and cross-domain generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NeuralMUSIC plugs a neural covariance estimator into the MUSIC pipeline plus frequency attention and self-supervised pretraining, but the abstract supplies no numbers or checks that the estimated matrix actually preserves the subspace properties MUSIC relies on.

read the letter

The main thing here is a hybrid setup: a neural net estimates the spatial covariance from mic arrays, that matrix goes into classical MUSIC for EVD and pseudo-spectrum, then a Frequency Attention Fusion module cleans up the DOA output, all trained with a Self-supervised Spatial Correlation Learning trick on unlabeled data. The named pieces are new as a package for robot audition.

The framing is reasonable. Classical MUSIC is known to fall apart at low SNR, and end-to-end networks often fail to generalize across rooms or robot platforms, so trying to keep the subspace method while learning a better covariance input is a logical engineering move. The self-supervised angle also makes sense when labeled DOA data is expensive.

The soft spot is exactly the one the stress-test note flags. The abstract asserts competitive accuracy, robustness, and cross-domain gains, yet gives zero quantitative results, no baselines, no ablation that swaps the neural covariance for the sample covariance, and no check on whether the estimated matrix keeps the signal-noise orthogonality (subspace angles, covariance error norms, or low-SNR behavior). Without those, it is impossible to tell whether the hybrid actually helps or whether any reported lift comes from the attention module or dataset tuning. If the full paper contains those checks and they hold, the claim strengthens; on the abstract alone the central substitution remains unverified.

This is for people working on deployed robot audition systems who already use subspace methods and want a drop-in neural upgrade. A reader who needs a concrete recipe with reproducible numbers will get limited value until the experiments are shown.

Send it to peer review if the full manuscript includes the missing ablations and direct comparisons; otherwise it stays too preliminary for referee time.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes NeuralMUSIC, a hybrid neural-subspace framework for robotic sound source localization. A neural network estimates the spatial covariance matrix from multichannel microphone observations; this estimate is substituted into the classical MUSIC pipeline (eigenvalue decomposition followed by pseudo-spectrum computation). A Frequency Attention Fusion (FAF) module produces the final DOA estimates, and a Self-supervised Spatial Correlation Learning (SSCL) strategy is introduced to leverage unlabeled data. The central claim is that the approach achieves competitive localization accuracy with improved robustness and cross-domain generalization across robotic tasks.

Significance. If the hybrid substitution is shown to preserve MUSIC subspace properties while adding robustness, the work would provide a concrete example of combining the theoretical grounding of classical subspace methods with the adaptability of neural networks for robot audition. The SSCL component is a positive element for data efficiency in practical settings.

major comments (2)

[Abstract] Abstract: the performance claims (competitive accuracy, improved robustness, cross-domain generalization) are stated without any quantitative results, baselines, error metrics, dataset descriptions, or ablation studies, so the central claim cannot be evaluated.
[Neural covariance estimation and MUSIC pipeline] Method (neural covariance integration into MUSIC): no metrics are supplied (e.g., covariance estimation error norms, principal angles between estimated and true signal/noise subspaces, or ablation replacing the NN output with sample covariance) demonstrating that the neural estimate preserves the signal-noise orthogonality required for accurate pseudo-spectrum peaks, especially at low SNR. This is load-bearing for the claim that the hybrid pipeline outperforms or matches classical MUSIC.

minor comments (1)

[Method] The description of the FAF module and SSCL loss could include pseudocode or explicit equations for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to strengthen the presentation of results and validation of the hybrid approach.

read point-by-point responses

Referee: [Abstract] Abstract: the performance claims (competitive accuracy, improved robustness, cross-domain generalization) are stated without any quantitative results, baselines, error metrics, dataset descriptions, or ablation studies, so the central claim cannot be evaluated.

Authors: We agree that the abstract would benefit from including key quantitative highlights. In the revised version, we will update the abstract to report specific metrics such as mean angular error on the robotic datasets, comparisons to classical MUSIC and deep learning baselines, and brief dataset references. This will make the central claims directly evaluable. revision: yes
Referee: [Neural covariance estimation and MUSIC pipeline] Method (neural covariance integration into MUSIC): no metrics are supplied (e.g., covariance estimation error norms, principal angles between estimated and true signal/noise subspaces, or ablation replacing the NN output with sample covariance) demonstrating that the neural estimate preserves the signal-noise orthogonality required for accurate pseudo-spectrum peaks, especially at low SNR. This is load-bearing for the claim that the hybrid pipeline outperforms or matches classical MUSIC.

Authors: We acknowledge that explicit metrics on the neural covariance estimate are needed to directly support preservation of MUSIC subspace properties. While overall localization results provide supporting evidence, we agree this validation is important. In the revision, we will add covariance estimation error norms, principal angles between estimated and true subspaces at varying SNRs, and an ablation replacing the neural output with sample covariance within the MUSIC pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity: hybrid pipeline is feed-forward with independent classical MUSIC step

full rationale

The paper describes a standard hybrid architecture in which a neural network produces a covariance estimate that is then substituted into the unmodified MUSIC pipeline (EVD + pseudo-spectrum). No equation or claim reduces the final DOA output to a quantity defined in terms of itself, nor does any 'prediction' consist of a fitted parameter renamed as output. SSCL is a self-supervised pre-training step on unlabeled data and does not create a definitional loop with the localization result. The central claim therefore rests on empirical performance rather than on any self-referential derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Information is limited to the abstract; the ledger therefore records only the most obvious domain assumptions required for the described pipeline to be well-defined.

axioms (1)

domain assumption The spatial covariance matrix estimated by the neural network can be substituted into the eigenvalue decomposition step of MUSIC without invalidating the underlying subspace separation assumptions.
The hybrid claim rests on this substitution being valid under the noise and array conditions encountered in the robotic tasks.

pith-pipeline@v0.9.1-grok · 5717 in / 1200 out tokens · 29130 ms · 2026-06-26T19:59:05.225996+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 1 linked inside Pith

[1]

Self-supervised neural audio-visual sound source local- ization via probabilistic spatial modeling,

Y . Masuyama, Y . Bando, K. Yatabe, Y . Sasaki, M. Onishi, and Y . Oikawa, “Self-supervised neural audio-visual sound source local- ization via probabilistic spatial modeling,” in2020 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pp. 4848– 4854, IEEE, 2020

2020
[2]

Sound source localization for human-robot interaction in outdoor environments,

V . Liu, T. Du, J. Sehn, J. Collier, and F. Grondin, “Sound source localization for human-robot interaction in outdoor environments,” in 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6121–6126, IEEE, 2025

2025
[3]

Av-pedaware: Self- supervised audio-visual fusion for dynamic pedestrian awareness,

Y . Yang, S. Yuan, M. Cao, J. Yang, and L. Xie, “Av-pedaware: Self- supervised audio-visual fusion for dynamic pedestrian awareness,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1871–1877, IEEE, 2023

2023
[4]

The un-kidnappable robot: Acoustic localization of sneaking people,

M. Yang, P. Grady, S. Brahmbhatt, A. B. Vasudevan, C. C. Kemp, and J. Hays, “The un-kidnappable robot: Acoustic localization of sneaking people,” in2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 985–992, IEEE, 2024

2024
[5]

Hearing what you cannot see: Acoustic vehicle detection around corners,

Y . Schulz, A. K. Mattar, T. M. Hehn, and J. F. Kooij, “Hearing what you cannot see: Acoustic vehicle detection around corners,”IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 2587–2594, 2021

2021
[6]

Continuous sound source localization based on microphone array for mobile robots,

H. Liu and M. Shen, “Continuous sound source localization based on microphone array for mobile robots,” in2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4332–4339, IEEE, 2010

2010
[7]

Localization of simultaneous moving sound sources for mobile robot using a frequency-domain steered beamformer approach,

J.-M. Valin, F. Michaud, B. Hadjou, and J. Rouat, “Localization of simultaneous moving sound sources for mobile robot using a frequency-domain steered beamformer approach,” inIEEE Interna- tional Conference on Robotics and Automation, 2004. Proceedings. ICRA’04. 2004, vol. 1, pp. 1033–1038, IEEE, 2004

2004
[8]

Multiple emitter location and signal parameter estima- tion,

R. Schmidt, “Multiple emitter location and signal parameter estima- tion,”IEEE transactions on antennas and propagation, vol. 34, no. 3, pp. 276–280, 1986

1986
[9]

Doanet: a deep dilated convolutional neural network approach for search and rescue with drone-embedded sound source localization,

A. B. A. Qayyum, K. N. Hassan, A. Anika, M. F. Shadiq, M. M. Rahman, M. T. Islam, S. A. Imran, S. Hossain, and M. A. Haque, “Doanet: a deep dilated convolutional neural network approach for search and rescue with drone-embedded sound source localization,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2020, no. 1, p. 16, 2020

2020
[10]

Afpild: Acoustic footstep dataset collected using one microphone array and lidar sensor for person identification and localization,

S. Wu, S. Huang, Z. Liu, Q. Zhang, and J. Liu, “Afpild: Acoustic footstep dataset collected using one microphone array and lidar sensor for person identification and localization,”Information Fusion, vol. 104, p. 102181, 2024

2024
[11]

Unet- rootmusic: A high accuracy direction of arrival estimation method under array imperfection,

D.-T. Nguyen, T.-H. Le, V .-S. Doan, and V .-P. Hoang, “Unet- rootmusic: A high accuracy direction of arrival estimation method under array imperfection,”AEU-International Journal of Electronics and Communications, vol. 173, p. 155008, 2024

2024
[12]

Subspacenet: Deep learning-aided subspace methods for doa estimation,

D. H. Shmuel, J. P. Merkofer, G. Revach, R. J. Van Sloun, and N. Shlezinger, “Subspacenet: Deep learning-aided subspace methods for doa estimation,”IEEE Transactions on Vehicular Technology, 2024

2024
[13]

Da-music: Data-driven doa estimation via deep aug- mented music algorithm,

J. P. Merkofer, G. Revach, N. Shlezinger, T. Routtenberg, and R. J. Van Sloun, “Da-music: Data-driven doa estimation via deep aug- mented music algorithm,”IEEE Transactions on Vehicular Technology, vol. 73, no. 2, pp. 2771–2785, 2023

2023
[14]

Incoherent frequency fusion for broadband steered response power algorithms in noisy environ- ments,

D. Salvati, C. Drioli, and G. L. Foresti, “Incoherent frequency fusion for broadband steered response power algorithms in noisy environ- ments,”IEEE Signal Processing Letters, vol. 21, no. 5, pp. 581–585, 2014

2014
[15]

Coherent signal-subspace processing for the detection and estimation of angles of arrival of multiple wide- band sources,

H. Wang and M. Kaveh, “Coherent signal-subspace processing for the detection and estimation of angles of arrival of multiple wide- band sources,”IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 33, no. 4, pp. 823–831, 1985

1985
[16]

Speech commands: A dataset for limited-vocabulary speech recognition. arxiv 2018,

P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition. arxiv 2018,”arXiv preprint arXiv:1804.03209, 1804

Pith/arXiv arXiv 2018
[17]

Av16. 3: An audio- visual corpus for speaker localization and tracking,

G. Lathoud, J.-M. Odobez, and D. Gatica-Perez, “Av16. 3: An audio- visual corpus for speaker localization and tracking,” inInternational Workshop on Machine Learning for Multimodal Interaction, pp. 182– 195, Springer, 2004

2004
[18]

Sloclas: A database for joint sound localization and classification,

X. Qian, B. Sharma, A. El Abridi, and H. Li, “Sloclas: A database for joint sound localization and classification,” in2021 24th Conference of the Oriental COCOSDA International Committee for the Co- ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), pp. 128–133, IEEE, 2021

2021
[19]

Pyroomacoustics: A python package for audio room simulation and array processing algorithms,

R. Scheibler, E. Bezzam, and I. Dokmani ´c, “Pyroomacoustics: A python package for audio room simulation and array processing algorithms,” in2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 351–355, IEEE, 2018

2018
[20]

Tops: New doa esti- mator for wideband signals,

Y .-S. Yoon, L. M. Kaplan, and J. H. McClellan, “Tops: New doa esti- mator for wideband signals,”IEEE Transactions on Signal processing, vol. 54, no. 6, pp. 1977–1989, 2006

1977
[21]

Frida: Fri-based doa estimation for arbitrary array layouts,

H. Pan, R. Scheibler, E. Bezzam, I. Dokmani ´c, and M. Vetterli, “Frida: Fri-based doa estimation for arbitrary array layouts,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3186–3190, IEEE, 2017

2017
[22]

Deep neural networks for multiple speaker detection and localization,

W. He, P. Motlicek, and J.-M. Odobez, “Deep neural networks for multiple speaker detection and localization,” in2018 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pp. 74–79, IEEE, 2018

2018
[23]

Deepmusic: Multiple signal classification via deep learning,

A. M. Elbir, “Deepmusic: Multiple signal classification via deep learning,”IEEE Sensors Letters, vol. 4, no. 4, pp. 1–4, 2020

2020
[24]

Bast: Bin- aural audio spectrogram transformer for binaural sound localization,

S. Kuang, J. Shi, K. van der Heijden, and S. Mehrkanoon, “Bast: Bin- aural audio spectrogram transformer for binaural sound localization,” arXiv preprint arXiv:2207.03927, 2022

arXiv 2022
[25]

Multi- speaker tracking from an audio–visual sensing device,

X. Qian, A. Brutti, O. Lanz, M. Omologo, and A. Cavallaro, “Multi- speaker tracking from an audio–visual sensing device,”IEEE Trans- actions on Multimedia, vol. 21, no. 10, pp. 2576–2588, 2019

2019
[26]

Transmusic: A transformer- aided subspace method for doa estimation with low-resolution adcs,

J. Ji, W. Mao, F. Xi, and S. Chen, “Transmusic: A transformer- aided subspace method for doa estimation with low-resolution adcs,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8576–8580, IEEE, 2024. APPENDIX A. Datasets and Experimental Setup Details In the experiments, we compare the performanc...

2024
[27]

The simulated environment consists of a7×7×3m room with a sampling rate of 16 kHz, and additive noise is added with a signal-to- noise ratio (SNR) of 30 dB

toolkit with the image-source method. The simulated environment consists of a7×7×3m room with a sampling rate of 16 kHz, and additive noise is added with a signal-to- noise ratio (SNR) of 30 dB. The sound source is randomly positioned around the microphone array with a distance ranging from 0.5 m to 2.0 m and an azimuth angle uniformly distributed between...
[28]

However, at this stage some information from the raw input may already be lost, as the estimation relies on the intermediate covariance representation

as an example, its source number estimation mod- ule is placed after the eigenvalue decomposition (EVD) stage. However, at this stage some information from the raw input may already be lost, as the estimation relies on the intermediate covariance representation. This may reduce the accuracy of source number estimation. In contrast, our method predicts the...

[1] [1]

Self-supervised neural audio-visual sound source local- ization via probabilistic spatial modeling,

Y . Masuyama, Y . Bando, K. Yatabe, Y . Sasaki, M. Onishi, and Y . Oikawa, “Self-supervised neural audio-visual sound source local- ization via probabilistic spatial modeling,” in2020 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pp. 4848– 4854, IEEE, 2020

2020

[2] [2]

Sound source localization for human-robot interaction in outdoor environments,

V . Liu, T. Du, J. Sehn, J. Collier, and F. Grondin, “Sound source localization for human-robot interaction in outdoor environments,” in 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6121–6126, IEEE, 2025

2025

[3] [3]

Av-pedaware: Self- supervised audio-visual fusion for dynamic pedestrian awareness,

Y . Yang, S. Yuan, M. Cao, J. Yang, and L. Xie, “Av-pedaware: Self- supervised audio-visual fusion for dynamic pedestrian awareness,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1871–1877, IEEE, 2023

2023

[4] [4]

The un-kidnappable robot: Acoustic localization of sneaking people,

M. Yang, P. Grady, S. Brahmbhatt, A. B. Vasudevan, C. C. Kemp, and J. Hays, “The un-kidnappable robot: Acoustic localization of sneaking people,” in2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 985–992, IEEE, 2024

2024

[5] [5]

Hearing what you cannot see: Acoustic vehicle detection around corners,

Y . Schulz, A. K. Mattar, T. M. Hehn, and J. F. Kooij, “Hearing what you cannot see: Acoustic vehicle detection around corners,”IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 2587–2594, 2021

2021

[6] [6]

Continuous sound source localization based on microphone array for mobile robots,

H. Liu and M. Shen, “Continuous sound source localization based on microphone array for mobile robots,” in2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4332–4339, IEEE, 2010

2010

[7] [7]

Localization of simultaneous moving sound sources for mobile robot using a frequency-domain steered beamformer approach,

J.-M. Valin, F. Michaud, B. Hadjou, and J. Rouat, “Localization of simultaneous moving sound sources for mobile robot using a frequency-domain steered beamformer approach,” inIEEE Interna- tional Conference on Robotics and Automation, 2004. Proceedings. ICRA’04. 2004, vol. 1, pp. 1033–1038, IEEE, 2004

2004

[8] [8]

Multiple emitter location and signal parameter estima- tion,

R. Schmidt, “Multiple emitter location and signal parameter estima- tion,”IEEE transactions on antennas and propagation, vol. 34, no. 3, pp. 276–280, 1986

1986

[9] [9]

Doanet: a deep dilated convolutional neural network approach for search and rescue with drone-embedded sound source localization,

A. B. A. Qayyum, K. N. Hassan, A. Anika, M. F. Shadiq, M. M. Rahman, M. T. Islam, S. A. Imran, S. Hossain, and M. A. Haque, “Doanet: a deep dilated convolutional neural network approach for search and rescue with drone-embedded sound source localization,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2020, no. 1, p. 16, 2020

2020

[10] [10]

Afpild: Acoustic footstep dataset collected using one microphone array and lidar sensor for person identification and localization,

S. Wu, S. Huang, Z. Liu, Q. Zhang, and J. Liu, “Afpild: Acoustic footstep dataset collected using one microphone array and lidar sensor for person identification and localization,”Information Fusion, vol. 104, p. 102181, 2024

2024

[11] [11]

Unet- rootmusic: A high accuracy direction of arrival estimation method under array imperfection,

D.-T. Nguyen, T.-H. Le, V .-S. Doan, and V .-P. Hoang, “Unet- rootmusic: A high accuracy direction of arrival estimation method under array imperfection,”AEU-International Journal of Electronics and Communications, vol. 173, p. 155008, 2024

2024

[12] [12]

Subspacenet: Deep learning-aided subspace methods for doa estimation,

D. H. Shmuel, J. P. Merkofer, G. Revach, R. J. Van Sloun, and N. Shlezinger, “Subspacenet: Deep learning-aided subspace methods for doa estimation,”IEEE Transactions on Vehicular Technology, 2024

2024

[13] [13]

Da-music: Data-driven doa estimation via deep aug- mented music algorithm,

J. P. Merkofer, G. Revach, N. Shlezinger, T. Routtenberg, and R. J. Van Sloun, “Da-music: Data-driven doa estimation via deep aug- mented music algorithm,”IEEE Transactions on Vehicular Technology, vol. 73, no. 2, pp. 2771–2785, 2023

2023

[14] [14]

Incoherent frequency fusion for broadband steered response power algorithms in noisy environ- ments,

D. Salvati, C. Drioli, and G. L. Foresti, “Incoherent frequency fusion for broadband steered response power algorithms in noisy environ- ments,”IEEE Signal Processing Letters, vol. 21, no. 5, pp. 581–585, 2014

2014

[15] [15]

Coherent signal-subspace processing for the detection and estimation of angles of arrival of multiple wide- band sources,

H. Wang and M. Kaveh, “Coherent signal-subspace processing for the detection and estimation of angles of arrival of multiple wide- band sources,”IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 33, no. 4, pp. 823–831, 1985

1985

[16] [16]

Speech commands: A dataset for limited-vocabulary speech recognition. arxiv 2018,

P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition. arxiv 2018,”arXiv preprint arXiv:1804.03209, 1804

Pith/arXiv arXiv 2018

[17] [17]

Av16. 3: An audio- visual corpus for speaker localization and tracking,

G. Lathoud, J.-M. Odobez, and D. Gatica-Perez, “Av16. 3: An audio- visual corpus for speaker localization and tracking,” inInternational Workshop on Machine Learning for Multimodal Interaction, pp. 182– 195, Springer, 2004

2004

[18] [18]

Sloclas: A database for joint sound localization and classification,

X. Qian, B. Sharma, A. El Abridi, and H. Li, “Sloclas: A database for joint sound localization and classification,” in2021 24th Conference of the Oriental COCOSDA International Committee for the Co- ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), pp. 128–133, IEEE, 2021

2021

[19] [19]

Pyroomacoustics: A python package for audio room simulation and array processing algorithms,

R. Scheibler, E. Bezzam, and I. Dokmani ´c, “Pyroomacoustics: A python package for audio room simulation and array processing algorithms,” in2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 351–355, IEEE, 2018

2018

[20] [20]

Tops: New doa esti- mator for wideband signals,

Y .-S. Yoon, L. M. Kaplan, and J. H. McClellan, “Tops: New doa esti- mator for wideband signals,”IEEE Transactions on Signal processing, vol. 54, no. 6, pp. 1977–1989, 2006

1977

[21] [21]

Frida: Fri-based doa estimation for arbitrary array layouts,

H. Pan, R. Scheibler, E. Bezzam, I. Dokmani ´c, and M. Vetterli, “Frida: Fri-based doa estimation for arbitrary array layouts,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3186–3190, IEEE, 2017

2017

[22] [22]

Deep neural networks for multiple speaker detection and localization,

W. He, P. Motlicek, and J.-M. Odobez, “Deep neural networks for multiple speaker detection and localization,” in2018 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pp. 74–79, IEEE, 2018

2018

[23] [23]

Deepmusic: Multiple signal classification via deep learning,

A. M. Elbir, “Deepmusic: Multiple signal classification via deep learning,”IEEE Sensors Letters, vol. 4, no. 4, pp. 1–4, 2020

2020

[24] [24]

Bast: Bin- aural audio spectrogram transformer for binaural sound localization,

S. Kuang, J. Shi, K. van der Heijden, and S. Mehrkanoon, “Bast: Bin- aural audio spectrogram transformer for binaural sound localization,” arXiv preprint arXiv:2207.03927, 2022

arXiv 2022

[25] [25]

Multi- speaker tracking from an audio–visual sensing device,

X. Qian, A. Brutti, O. Lanz, M. Omologo, and A. Cavallaro, “Multi- speaker tracking from an audio–visual sensing device,”IEEE Trans- actions on Multimedia, vol. 21, no. 10, pp. 2576–2588, 2019

2019

[26] [26]

Transmusic: A transformer- aided subspace method for doa estimation with low-resolution adcs,

J. Ji, W. Mao, F. Xi, and S. Chen, “Transmusic: A transformer- aided subspace method for doa estimation with low-resolution adcs,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8576–8580, IEEE, 2024. APPENDIX A. Datasets and Experimental Setup Details In the experiments, we compare the performanc...

2024

[27] [27]

The simulated environment consists of a7×7×3m room with a sampling rate of 16 kHz, and additive noise is added with a signal-to- noise ratio (SNR) of 30 dB

toolkit with the image-source method. The simulated environment consists of a7×7×3m room with a sampling rate of 16 kHz, and additive noise is added with a signal-to- noise ratio (SNR) of 30 dB. The sound source is randomly positioned around the microphone array with a distance ranging from 0.5 m to 2.0 m and an azimuth angle uniformly distributed between...

[28] [28]

However, at this stage some information from the raw input may already be lost, as the estimation relies on the intermediate covariance representation

as an example, its source number estimation mod- ule is placed after the eigenvalue decomposition (EVD) stage. However, at this stage some information from the raw input may already be lost, as the estimation relies on the intermediate covariance representation. This may reduce the accuracy of source number estimation. In contrast, our method predicts the...