Recognition: unknown
ASAP: An Azimuth-Priority Strip-Based Search Approach to Planar Microphone Array DOA Estimation in 3D
Pith reviewed 2026-05-07 14:24 UTC · model grok-4.3
The pith
ASAP locks azimuth first in strips then refines elevation on arcs to reduce SRP-PHAT evaluations for planar arrays.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ASAP performs coarse-to-fine region contraction within azimuthal strips to lock azimuth angles while retaining multiple maxima through spherical caps in the first stage; it then refines elevation along the great-circle arc between two close candidates in the second stage.
What carries the argument
Azimuth-priority strip contraction that retains maxima inside spherical caps followed by great-circle-arc elevation refinement.
Load-bearing premise
Azimuth estimation remains reliably more accurate than elevation estimation for most planar microphone arrays.
What would settle it
A controlled test in which the true source direction lies outside all retained spherical caps after the first-stage azimuth contraction, causing the second stage to miss the correct elevation.
Figures
read the original abstract
Direction-of-arrival (DOA) estimation is an important task in microphone array processing and many downstream applications. The steered response power with phase transform (SRP-PHAT) method has been widely adopted for DOA estimation in recent years. However, accurate SRP-PHAT estimation in 3D scenarios requires evaluating steering responses over thousands of candidate directions, severely limiting real-time performance on resource-constrained platforms. This challenge becomes even more critical for planar arrays, which are widely used in robotics due to their structural simplicity. Motivated by the fact that azimuth estimation is usually more reliable than elevation estimation for most arrays, we propose ASAP, an azimuth-priority strip-based search approach to planar microphone array DOA estimation in 3D. In the first stage, ASAP performs coarse-to-fine region contraction within azimuthal strips to lock azimuth angles while retaining multiple maxima through spherical caps. In the second stage, it refines elevation along the great-circle arc between two close candidates. Extensive simulations and real-world experiments validate the efficiency and merits of the proposed method over existing approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ASAP, an azimuth-priority strip-based search algorithm for efficient 3D DOA estimation with planar microphone arrays via SRP-PHAT. Motivated by the observation that azimuth estimates are typically more reliable than elevation estimates, the method uses a two-stage procedure: coarse-to-fine region contraction inside azimuthal strips to lock azimuth while retaining multiple candidate maxima through spherical caps, followed by elevation refinement along the great-circle arc connecting the two closest candidates. The authors state that simulations and real-world experiments demonstrate improved efficiency and performance relative to existing approaches.
Significance. If the accuracy claims hold, the work offers a practical, low-complexity alternative to exhaustive 3D grid searches for SRP-PHAT on planar arrays, which are common in robotics and other resource-constrained settings. The strip-based contraction exploits a domain-specific reliability difference to reduce the number of steering-vector evaluations without requiring new hardware or array geometries.
major comments (2)
- Abstract: the claim that 'extensive simulations and real-world experiments validate the efficiency and merits' is unsupported by any quantitative metrics, error bars, RMSE tables, or runtime comparisons in the provided text. This omission leaves the central performance claim (efficiency without accuracy loss) without visible evidence.
- First-stage description (azimuthal-strip contraction): the procedure retains multiple maxima via spherical caps but supplies no analytic bound on the maximum elevation mismatch that can be tolerated before the true azimuth peak falls outside the contracted strip. For planar arrays the SRP-PHAT surface can exhibit elevation-dependent grating lobes or broadened main lobes; if the initial coarse strip misses the global maximum, the subsequent arc refinement cannot recover it.
minor comments (1)
- The motivation paragraph would benefit from a brief citation to prior literature quantifying the relative reliability of azimuth versus elevation for planar arrays.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments. We have addressed each point below and revised the manuscript to improve clarity and strengthen the supporting evidence.
read point-by-point responses
-
Referee: Abstract: the claim that 'extensive simulations and real-world experiments validate the efficiency and merits' is unsupported by any quantitative metrics, error bars, RMSE tables, or runtime comparisons in the provided text. This omission leaves the central performance claim (efficiency without accuracy loss) without visible evidence.
Authors: We agree that the abstract would benefit from more specific quantitative support. Although the full manuscript presents detailed simulation and experimental results with RMSE comparisons and runtime measurements in Sections IV and V, these were not highlighted in the abstract. We have revised the abstract to reference the key quantitative outcomes from our evaluations, and we have added a concise performance summary table to the revised manuscript to make the efficiency and accuracy claims directly visible. revision: yes
-
Referee: First-stage description (azimuthal-strip contraction): the procedure retains multiple maxima via spherical caps but supplies no analytic bound on the maximum elevation mismatch that can be tolerated before the true azimuth peak falls outside the contracted strip. For planar arrays the SRP-PHAT surface can exhibit elevation-dependent grating lobes or broadened main lobes; if the initial coarse strip misses the global maximum, the subsequent arc refinement cannot recover it.
Authors: We thank the referee for highlighting this important robustness consideration. The original manuscript provides empirical validation through simulations but does not include an analytic bound. In the revised version, we have added analysis in Section III that derives an approximate tolerance bound on elevation mismatch based on the expected main-lobe width of the SRP-PHAT spectrum for planar arrays. We have also included additional simulation results quantifying the capture rate of the true azimuth under varying elevation errors, confirming high reliability within practical operating ranges. We acknowledge that extreme grating-lobe scenarios could still cause the coarse stage to miss the global peak, and we have added a brief discussion of this limitation. revision: yes
Circularity Check
No circularity: procedural algorithm with independent motivation
full rationale
The paper describes a two-stage coarse-to-fine search heuristic for SRP-PHAT DOA estimation on planar arrays. The first stage contracts azimuthal strips while retaining maxima via spherical caps; the second refines elevation along great-circle arcs. This structure is presented as a direct algorithmic procedure motivated by the stated domain observation that azimuth is typically more reliable than elevation. No equations appear that define a quantity in terms of itself, no parameters are fitted to a data subset and then relabeled as predictions, and no uniqueness theorems or ansatzes are imported via self-citation. Validation rests on external simulations and real-world experiments rather than internal reduction, rendering the derivation self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Robust beamforming for multispeaker audio conferencing under DOA uncertainty,
G. Itzhak and I. Cohen, “Robust beamforming for multispeaker audio conferencing under DOA uncertainty,”IEEE/ACM Trans. Audio Speech Lang. Process., vol. 33, pp. 139–151, 2025
2025
-
[2]
Informed vs. blind beamforming in ad-hoc acoustic sensor networks for meeting transcription,
T. Gburrek, J. Schmalenstroeer, J. Heitkaemper, and R. Haeb-Umbach, “Informed vs. blind beamforming in ad-hoc acoustic sensor networks for meeting transcription,” inProc. 17th International Workshop on Acoustic Signal Enhancement (IWAENC), 2022, pp. 1–5
2022
-
[3]
Meeting transcription using asynchronous distant mi- crophones,
T. Yoshioka, D. Dimitriadis, A. Stolcke, W. Hinthorn, Z. Chen, M. Zeng, and X. Huang, “Meeting transcription using asynchronous distant mi- crophones,” inProc. Interspeech, 2019, pp. 2968–2972
2019
-
[4]
Listen to extract: Onset-prompted target speaker extraction,
P. Shen, K. Chen, S. He, P. Chen, S. Yuan, H. Kong, X. Zhang, and Z.- Q. Wang, “Listen to extract: Onset-prompted target speaker extraction,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 33, pp. 4832– 4843, 2025
2025
-
[5]
A microphone array system for automatic fall detection,
Y . Li, K. C. Ho, and M. Popescu, “A microphone array system for automatic fall detection,”IEEE Trans. Biomed. Eng., vol. 59, no. 5, pp. 1291–1301, May 2012
2012
-
[6]
An accurate algebraic closed-form solution for energy-based source localization,
K. C. Ho and M. Sun, “An accurate algebraic closed-form solution for energy-based source localization,”IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 8, pp. 2542–2550, Nov. 2007
2007
-
[7]
A survey of sound source localization with deep learning methods,
P.-A. Grumiaux, S. Kiti ´c, L. Girin, and A. Gu ´erin, “A survey of sound source localization with deep learning methods,”The Journal of the Acoustical Society of America, vol. 152, no. 1, pp. 107–151, 2022
2022
-
[8]
Observability-driven assignment of heterogeneous sensors for multi-target tracking,
S. A. Rakhshan, M. Golestani, and H. Kong, “Observability-driven assignment of heterogeneous sensors for multi-target tracking,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2025, pp. 12100–12107
2025
-
[9]
Manifold- optimization-based 3D sound source mapping with unknown camera- microphone array relative pose,
J. Wang, R. Shi, J. Li, H. Kong, and K. Nakadai, “Manifold- optimization-based 3D sound source mapping with unknown camera- microphone array relative pose,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2026, pp. 21026–21030
2026
-
[10]
Improved extrinsic calibration of acoustic cameras via batch optimization,
Z. Li, J. Wang, X. Li, and H. Kong, “Improved extrinsic calibration of acoustic cameras via batch optimization,” inProc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Apr 2025
2025
-
[11]
I-ASM: Iterative acoustic scene mapping for enhanced robot auditory perception in complex indoor environments,
L. Fu, Y . He, J. Wang, X. Qiao, and H. Kong, “I-ASM: Iterative acoustic scene mapping for enhanced robot auditory perception in complex indoor environments,” inProc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), 2024, pp. 12 318–12 323
2024
-
[12]
Visualization and quantification of the activities of animal vocalizations in forest species using robot audition techniques,
H. Zhao, R. Suzuki, S. Sumitani, S. Matsubayashi, T. Arita, K. Nakadai, and H. G. Okuno, “Visualization and quantification of the activities of animal vocalizations in forest species using robot audition techniques,” Journal of Ecoacoustics, vol. 7, no. 2, 2023
2023
-
[13]
Observability- aware active calibration of multisensor extrinsics for ground robots via online trajectory optimization,
J. Wang, Y . Kang, L. Fu, K. Nakadai, and H. Kong, “Observability- aware active calibration of multisensor extrinsics for ground robots via online trajectory optimization,”IEEE Sensors Journal, vol. 25, no. 17, pp. 33022–33036, 2025
2025
-
[14]
Optimal sensor placement for full-set TDOA localization accounting for sensor location errors,
C. Zhang, X. Han, H. Kong, and K. C. Ho, “Optimal sensor placement for full-set TDOA localization accounting for sensor location errors,” IEEE Trans. Aerospace and Electronic Systems, vol. 61, no. 4, pp. 10944–10950, 2025
2025
-
[15]
SAGENet: Binaural echo-based 3D depth estimation with sparse an- gular queries and refined geometric cues,
G. Liu, W. Cui, Y . Xi, L. Yang, P. Hu, H. Kong, and Z. Wang, “SAGENet: Binaural echo-based 3D depth estimation with sparse an- gular queries and refined geometric cues,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2025, pp. 6113–6120
2025
-
[16]
Necessary and sufficient conditions for observability of SLAM-based TDOA sensor array cali- bration and source localization,
D. Su, H. Kong, S. Sukkarieh, and S. Huang, “Necessary and sufficient conditions for observability of SLAM-based TDOA sensor array cali- bration and source localization,”IEEE Trans. Robot., vol. 37, no. 5, pp. 1451–1468, 2021
2021
-
[17]
Calibration of multiple asynchronous microphone arrays using hybrid TDOA,
C. Zhang, W. Pan, X. Han, and H. Kong, “Calibration of multiple asynchronous microphone arrays using hybrid TDOA,” inProc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Apr 2025
2025
-
[18]
Robots have been seen and not heard: Effects of consequential sounds on human-perception of robots,
A. Allen, T. Drummond, and D. Kuli ´c, “Robots have been seen and not heard: Effects of consequential sounds on human-perception of robots,” IEEE Robotics and Automation Letters, vol. 10, no. 4, pp. 3980–3987, 2025
2025
-
[19]
A high-accuracy, low-latency technique for talker lo- calization in reverberant environments using microphone arrays,
J. H. DiBiase, “A high-accuracy, low-latency technique for talker lo- calization in reverberant environments using microphone arrays,” Ph.D. dissertation, Brown University, 2000
2000
-
[20]
A generalized steered response power method for computationally viable source localization,
J. P. Dmochowski, J. Benesty, and S. Affes, “A generalized steered response power method for computationally viable source localization,” IEEE Trans. Audio Speech Lang. Process., vol. 15, no. 8, pp. 2510–2526, Nov 2007
2007
-
[21]
The generalized correlation method for estimation of time delay,
C. H. Knapp and G. C. Carter, “The generalized correlation method for estimation of time delay,”IEEE Trans. Acoust., Speech, Signal Process., vol. 24, no. 4, pp. 320–327, 1976
1976
-
[22]
An iteratively reweighted steered response power approach to multisource localization using a distributed microphone network,
X. Dang and H. Zhu, “An iteratively reweighted steered response power approach to multisource localization using a distributed microphone network,”J. Acoust. Soc. Am., vol. 155, no. 2, pp. 1182–1197, 2024
2024
-
[23]
SLAM-based joint calibration of multiple asynchronous microphone arrays and sound source localization,
J. Wang, Y . He, D. Su, K. Itoyama, K. Nakadai, J. Wu, S. Huang, Y . Li, and H. Kong, “SLAM-based joint calibration of multiple asynchronous microphone arrays and sound source localization,”IEEE Trans. Robot., vol. 40, pp. 4024–4044, 2024
2024
-
[24]
Steered response power for sound source localization: A tutorial review,
E. Grinstein, E. Tengan, B. C ¸ akmak, T. Dietzen, L. Nunes, T. van Waterschoot, M. Brookes, and P. A. Naylor, “Steered response power for sound source localization: A tutorial review,”EURASIP J. Audio Speech Music Process., vol. 2024, p. 59, 2024
2024
-
[25]
Direction of arrival estimation with microphone arrays using SRP-PHAT and neural net- works,
D. D ´ıaz-Guerra, A. Miguel, and J. R. Beltr ´an, “Direction of arrival estimation with microphone arrays using SRP-PHAT and neural net- works,” inProc. IEEE Sensor Array and Multichannel Signal Processing Workshop (SAM), 2018, pp. 617–621
2018
-
[26]
Hybrid AOA-TDOA localization of a moving source by single receiver,
D. Pang, G. Wang, and K. C. Ho, “Hybrid AOA-TDOA localization of a moving source by single receiver,”IEEE Trans. Commun., vol. 73, no. 6, pp. 4088–4104, June 2025
2025
-
[27]
NLOS error mitigation for TOA-based localization via convex relaxation,
G. Wang, H. Chen, Y . Li, and N. Ansari, “NLOS error mitigation for TOA-based localization via convex relaxation,”IEEE Trans. Wireless Commun., vol. 13, no. 8, pp. 4119–4131, Aug. 2014
2014
-
[28]
Elliptic localization of a moving object by transmitter at unknown position and velocity: A semidefinite relaxation approach,
G. Wang, R. Zheng, and K. C. Ho, “Elliptic localization of a moving object by transmitter at unknown position and velocity: A semidefinite relaxation approach,”IEEE Trans. Mobile Comput., vol. 22, no. 5, pp. 2675–2692, May 2023
2023
-
[29]
Accelerated speech source localiza- tion via a hierarchical search of steered response power,
D. N. Zotkin and R. Duraiswami, “Accelerated speech source localiza- tion via a hierarchical search of steered response power,”IEEE Trans. Speech Audio Process., vol. 12, no. 5, pp. 957–964, 2004
2004
-
[30]
Single-microphone- based sound source localization for mobile robots in reverberant envi- ronments,
J. Wang, R. Shi, B. Yen, H. Kong, and K. Nakadai, “Single-microphone- based sound source localization for mobile robots in reverberant envi- ronments,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2025, pp. 6135–6140
2025
-
[31]
Design of broad-band circular ring microphone array for speech acquisition in 3-D,
Y . Li, K. C. Ho, and C. Kwan, “Design of broad-band circular ring microphone array for speech acquisition in 3-D,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2003, pp. V-221–V- 224
2003
-
[32]
Auditory localization: A comprehensive practical review,
A. Carlini, C. Bordeau, and M. Ambard, “Auditory localization: A comprehensive practical review,”Front. Psychol., vol. 15, p. 1408073, 2024
2024
-
[33]
AuralNet: Hierarchical attention-based 3D binaural localization of overlapping speakers,
L. Fu, Y . Liu, Z. Liu, Z. Yang, Z.-Q. Wang, Y . Li, and H. Kong, “AuralNet: Hierarchical attention-based 3D binaural localization of overlapping speakers,”Interspeech, 2025
2025
-
[34]
A modified SRP-PHAT functional for robust real-time sound source localization with scalable spatial sampling,
M. Cobos, A. Mart ´ı, and J. J. L´opez, “A modified SRP-PHAT functional for robust real-time sound source localization with scalable spatial sampling,”IEEE Signal Process. Lett., vol. 18, no. 1, pp. 71–74, 2011
2011
-
[35]
A volumetric SRP with refinement step for sound source localization,
M. V . S. Lima, W. A. Martins, L. O. Nunes, L. W. P. Biscainho, T. N. Ferreira, M. V . M. Costa, and B. Lee, “A volumetric SRP with refinement step for sound source localization,”IEEE Signal Process. Lett., vol. 22, no. 8, pp. 1098–1102, 2015
2015
-
[36]
A steered-response power algorithm employing hierarchical search for acoustic source localization using microphone arrays,
L. O. Nunes, W. A. Martins, M. V . S. Lima, L. W. P. Biscainho, M. V . M. Costa, F. M. Gonc ¸alves, A. Said, and B. Lee, “A steered-response power algorithm employing hierarchical search for acoustic source localization using microphone arrays,”IEEE Trans. Signal Process., vol. 62, no. 19, pp. 5171–5183, 2014
2014
-
[37]
A real-time SRP-PHAT source location implementation using stochastic region contraction (SRC) on a large-aperture microphone array,
H. Do, H. F. Silverman, and Y . Yu, “A real-time SRP-PHAT source location implementation using stochastic region contraction (SRC) on a large-aperture microphone array,” inProc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), vol. 1, 2007, pp. 121–124
2007
-
[38]
A fast microphone array SRP-PHAT source location implementation using coarse-to-fine region contraction (CFRC),
H. Do and H. F. Silverman, “A fast microphone array SRP-PHAT source location implementation using coarse-to-fine region contraction (CFRC),” inProc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2007, pp. 295–298
2007
-
[39]
A 2-D DOA estimation method with reduced complexity in unfolded coprime L-shaped array,
Z. Zhang, Y . Guo, Y . Huang, and P. Zhang, “A 2-D DOA estimation method with reduced complexity in unfolded coprime L-shaped array,” IEEE Syst. J., vol. 15, no. 1, pp. 407–410, 2021
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.