pith. machine review for the scientific record. sign in

arxiv: 2604.25387 · v1 · submitted 2026-04-28 · 📡 eess.AS · cs.RO

Recognition: unknown

ASAP: An Azimuth-Priority Strip-Based Search Approach to Planar Microphone Array DOA Estimation in 3D

Authors on Pith no claims yet

Pith reviewed 2026-05-07 14:24 UTC · model grok-4.3

classification 📡 eess.AS cs.RO
keywords DOA estimationSRP-PHATplanar microphone arrayazimuth search3D direction findingregion contractionreal-time localization
0
0 comments X

The pith

ASAP locks azimuth first in strips then refines elevation on arcs to reduce SRP-PHAT evaluations for planar arrays.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ASAP to solve the computational bottleneck of full-grid SRP-PHAT searches that evaluate thousands of directions for accurate 3D DOA with planar microphone arrays. It exploits the observation that azimuth estimates are typically more reliable than elevation by first performing coarse-to-fine contraction inside azimuthal strips, locking candidate azimuths while preserving multiple peaks inside spherical caps. The second stage then searches only along the great-circle arc connecting the two closest retained candidates to refine elevation. If effective, this cuts the number of steering-response calculations enough to support real-time 3D tracking on robots and other resource-limited hardware while keeping accuracy close to exhaustive search.

Core claim

ASAP performs coarse-to-fine region contraction within azimuthal strips to lock azimuth angles while retaining multiple maxima through spherical caps in the first stage; it then refines elevation along the great-circle arc between two close candidates in the second stage.

What carries the argument

Azimuth-priority strip contraction that retains maxima inside spherical caps followed by great-circle-arc elevation refinement.

Load-bearing premise

Azimuth estimation remains reliably more accurate than elevation estimation for most planar microphone arrays.

What would settle it

A controlled test in which the true source direction lies outside all retained spherical caps after the first-stage azimuth contraction, causing the second stage to miss the correct elevation.

Figures

Figures reproduced from arXiv: 2604.25387 by Hao Zhao, He Kong, Huanzhang Hu, Jiang Wang, Leying Yang, Ming Huang, Shuting Xu, Yujie Zhang, Yu Liu.

Figure 1
Figure 1. Figure 1: Overall architecture of the proposed ASAP framework. Stage 1 ˆ view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of RMSE under different source distances (1 m, 2 m, view at source ↗
Figure 4
Figure 4. Figure 4: Experimental platform setup. An 8-microphone uniform circular array view at source ↗
read the original abstract

Direction-of-arrival (DOA) estimation is an important task in microphone array processing and many downstream applications. The steered response power with phase transform (SRP-PHAT) method has been widely adopted for DOA estimation in recent years. However, accurate SRP-PHAT estimation in 3D scenarios requires evaluating steering responses over thousands of candidate directions, severely limiting real-time performance on resource-constrained platforms. This challenge becomes even more critical for planar arrays, which are widely used in robotics due to their structural simplicity. Motivated by the fact that azimuth estimation is usually more reliable than elevation estimation for most arrays, we propose ASAP, an azimuth-priority strip-based search approach to planar microphone array DOA estimation in 3D. In the first stage, ASAP performs coarse-to-fine region contraction within azimuthal strips to lock azimuth angles while retaining multiple maxima through spherical caps. In the second stage, it refines elevation along the great-circle arc between two close candidates. Extensive simulations and real-world experiments validate the efficiency and merits of the proposed method over existing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ASAP, an azimuth-priority strip-based search algorithm for efficient 3D DOA estimation with planar microphone arrays via SRP-PHAT. Motivated by the observation that azimuth estimates are typically more reliable than elevation estimates, the method uses a two-stage procedure: coarse-to-fine region contraction inside azimuthal strips to lock azimuth while retaining multiple candidate maxima through spherical caps, followed by elevation refinement along the great-circle arc connecting the two closest candidates. The authors state that simulations and real-world experiments demonstrate improved efficiency and performance relative to existing approaches.

Significance. If the accuracy claims hold, the work offers a practical, low-complexity alternative to exhaustive 3D grid searches for SRP-PHAT on planar arrays, which are common in robotics and other resource-constrained settings. The strip-based contraction exploits a domain-specific reliability difference to reduce the number of steering-vector evaluations without requiring new hardware or array geometries.

major comments (2)
  1. Abstract: the claim that 'extensive simulations and real-world experiments validate the efficiency and merits' is unsupported by any quantitative metrics, error bars, RMSE tables, or runtime comparisons in the provided text. This omission leaves the central performance claim (efficiency without accuracy loss) without visible evidence.
  2. First-stage description (azimuthal-strip contraction): the procedure retains multiple maxima via spherical caps but supplies no analytic bound on the maximum elevation mismatch that can be tolerated before the true azimuth peak falls outside the contracted strip. For planar arrays the SRP-PHAT surface can exhibit elevation-dependent grating lobes or broadened main lobes; if the initial coarse strip misses the global maximum, the subsequent arc refinement cannot recover it.
minor comments (1)
  1. The motivation paragraph would benefit from a brief citation to prior literature quantifying the relative reliability of azimuth versus elevation for planar arrays.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We have addressed each point below and revised the manuscript to improve clarity and strengthen the supporting evidence.

read point-by-point responses
  1. Referee: Abstract: the claim that 'extensive simulations and real-world experiments validate the efficiency and merits' is unsupported by any quantitative metrics, error bars, RMSE tables, or runtime comparisons in the provided text. This omission leaves the central performance claim (efficiency without accuracy loss) without visible evidence.

    Authors: We agree that the abstract would benefit from more specific quantitative support. Although the full manuscript presents detailed simulation and experimental results with RMSE comparisons and runtime measurements in Sections IV and V, these were not highlighted in the abstract. We have revised the abstract to reference the key quantitative outcomes from our evaluations, and we have added a concise performance summary table to the revised manuscript to make the efficiency and accuracy claims directly visible. revision: yes

  2. Referee: First-stage description (azimuthal-strip contraction): the procedure retains multiple maxima via spherical caps but supplies no analytic bound on the maximum elevation mismatch that can be tolerated before the true azimuth peak falls outside the contracted strip. For planar arrays the SRP-PHAT surface can exhibit elevation-dependent grating lobes or broadened main lobes; if the initial coarse strip misses the global maximum, the subsequent arc refinement cannot recover it.

    Authors: We thank the referee for highlighting this important robustness consideration. The original manuscript provides empirical validation through simulations but does not include an analytic bound. In the revised version, we have added analysis in Section III that derives an approximate tolerance bound on elevation mismatch based on the expected main-lobe width of the SRP-PHAT spectrum for planar arrays. We have also included additional simulation results quantifying the capture rate of the true azimuth under varying elevation errors, confirming high reliability within practical operating ranges. We acknowledge that extreme grating-lobe scenarios could still cause the coarse stage to miss the global peak, and we have added a brief discussion of this limitation. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural algorithm with independent motivation

full rationale

The paper describes a two-stage coarse-to-fine search heuristic for SRP-PHAT DOA estimation on planar arrays. The first stage contracts azimuthal strips while retaining maxima via spherical caps; the second refines elevation along great-circle arcs. This structure is presented as a direct algorithmic procedure motivated by the stated domain observation that azimuth is typically more reliable than elevation. No equations appear that define a quantity in terms of itself, no parameters are fitted to a data subset and then relabeled as predictions, and no uniqueness theorems or ansatzes are imported via self-citation. Validation rests on external simulations and real-world experiments rather than internal reduction, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach builds on standard SRP-PHAT without introducing new physical entities or fitted constants visible here.

pith-pipeline@v0.9.0 · 5518 in / 1049 out tokens · 43893 ms · 2026-05-07T14:24:29.287134+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references

  1. [1]

    Robust beamforming for multispeaker audio conferencing under DOA uncertainty,

    G. Itzhak and I. Cohen, “Robust beamforming for multispeaker audio conferencing under DOA uncertainty,”IEEE/ACM Trans. Audio Speech Lang. Process., vol. 33, pp. 139–151, 2025

  2. [2]

    Informed vs. blind beamforming in ad-hoc acoustic sensor networks for meeting transcription,

    T. Gburrek, J. Schmalenstroeer, J. Heitkaemper, and R. Haeb-Umbach, “Informed vs. blind beamforming in ad-hoc acoustic sensor networks for meeting transcription,” inProc. 17th International Workshop on Acoustic Signal Enhancement (IWAENC), 2022, pp. 1–5

  3. [3]

    Meeting transcription using asynchronous distant mi- crophones,

    T. Yoshioka, D. Dimitriadis, A. Stolcke, W. Hinthorn, Z. Chen, M. Zeng, and X. Huang, “Meeting transcription using asynchronous distant mi- crophones,” inProc. Interspeech, 2019, pp. 2968–2972

  4. [4]

    Listen to extract: Onset-prompted target speaker extraction,

    P. Shen, K. Chen, S. He, P. Chen, S. Yuan, H. Kong, X. Zhang, and Z.- Q. Wang, “Listen to extract: Onset-prompted target speaker extraction,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 33, pp. 4832– 4843, 2025

  5. [5]

    A microphone array system for automatic fall detection,

    Y . Li, K. C. Ho, and M. Popescu, “A microphone array system for automatic fall detection,”IEEE Trans. Biomed. Eng., vol. 59, no. 5, pp. 1291–1301, May 2012

  6. [6]

    An accurate algebraic closed-form solution for energy-based source localization,

    K. C. Ho and M. Sun, “An accurate algebraic closed-form solution for energy-based source localization,”IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 8, pp. 2542–2550, Nov. 2007

  7. [7]

    A survey of sound source localization with deep learning methods,

    P.-A. Grumiaux, S. Kiti ´c, L. Girin, and A. Gu ´erin, “A survey of sound source localization with deep learning methods,”The Journal of the Acoustical Society of America, vol. 152, no. 1, pp. 107–151, 2022

  8. [8]

    Observability-driven assignment of heterogeneous sensors for multi-target tracking,

    S. A. Rakhshan, M. Golestani, and H. Kong, “Observability-driven assignment of heterogeneous sensors for multi-target tracking,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2025, pp. 12100–12107

  9. [9]

    Manifold- optimization-based 3D sound source mapping with unknown camera- microphone array relative pose,

    J. Wang, R. Shi, J. Li, H. Kong, and K. Nakadai, “Manifold- optimization-based 3D sound source mapping with unknown camera- microphone array relative pose,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2026, pp. 21026–21030

  10. [10]

    Improved extrinsic calibration of acoustic cameras via batch optimization,

    Z. Li, J. Wang, X. Li, and H. Kong, “Improved extrinsic calibration of acoustic cameras via batch optimization,” inProc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Apr 2025

  11. [11]

    I-ASM: Iterative acoustic scene mapping for enhanced robot auditory perception in complex indoor environments,

    L. Fu, Y . He, J. Wang, X. Qiao, and H. Kong, “I-ASM: Iterative acoustic scene mapping for enhanced robot auditory perception in complex indoor environments,” inProc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), 2024, pp. 12 318–12 323

  12. [12]

    Visualization and quantification of the activities of animal vocalizations in forest species using robot audition techniques,

    H. Zhao, R. Suzuki, S. Sumitani, S. Matsubayashi, T. Arita, K. Nakadai, and H. G. Okuno, “Visualization and quantification of the activities of animal vocalizations in forest species using robot audition techniques,” Journal of Ecoacoustics, vol. 7, no. 2, 2023

  13. [13]

    Observability- aware active calibration of multisensor extrinsics for ground robots via online trajectory optimization,

    J. Wang, Y . Kang, L. Fu, K. Nakadai, and H. Kong, “Observability- aware active calibration of multisensor extrinsics for ground robots via online trajectory optimization,”IEEE Sensors Journal, vol. 25, no. 17, pp. 33022–33036, 2025

  14. [14]

    Optimal sensor placement for full-set TDOA localization accounting for sensor location errors,

    C. Zhang, X. Han, H. Kong, and K. C. Ho, “Optimal sensor placement for full-set TDOA localization accounting for sensor location errors,” IEEE Trans. Aerospace and Electronic Systems, vol. 61, no. 4, pp. 10944–10950, 2025

  15. [15]

    SAGENet: Binaural echo-based 3D depth estimation with sparse an- gular queries and refined geometric cues,

    G. Liu, W. Cui, Y . Xi, L. Yang, P. Hu, H. Kong, and Z. Wang, “SAGENet: Binaural echo-based 3D depth estimation with sparse an- gular queries and refined geometric cues,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2025, pp. 6113–6120

  16. [16]

    Necessary and sufficient conditions for observability of SLAM-based TDOA sensor array cali- bration and source localization,

    D. Su, H. Kong, S. Sukkarieh, and S. Huang, “Necessary and sufficient conditions for observability of SLAM-based TDOA sensor array cali- bration and source localization,”IEEE Trans. Robot., vol. 37, no. 5, pp. 1451–1468, 2021

  17. [17]

    Calibration of multiple asynchronous microphone arrays using hybrid TDOA,

    C. Zhang, W. Pan, X. Han, and H. Kong, “Calibration of multiple asynchronous microphone arrays using hybrid TDOA,” inProc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Apr 2025

  18. [18]

    Robots have been seen and not heard: Effects of consequential sounds on human-perception of robots,

    A. Allen, T. Drummond, and D. Kuli ´c, “Robots have been seen and not heard: Effects of consequential sounds on human-perception of robots,” IEEE Robotics and Automation Letters, vol. 10, no. 4, pp. 3980–3987, 2025

  19. [19]

    A high-accuracy, low-latency technique for talker lo- calization in reverberant environments using microphone arrays,

    J. H. DiBiase, “A high-accuracy, low-latency technique for talker lo- calization in reverberant environments using microphone arrays,” Ph.D. dissertation, Brown University, 2000

  20. [20]

    A generalized steered response power method for computationally viable source localization,

    J. P. Dmochowski, J. Benesty, and S. Affes, “A generalized steered response power method for computationally viable source localization,” IEEE Trans. Audio Speech Lang. Process., vol. 15, no. 8, pp. 2510–2526, Nov 2007

  21. [21]

    The generalized correlation method for estimation of time delay,

    C. H. Knapp and G. C. Carter, “The generalized correlation method for estimation of time delay,”IEEE Trans. Acoust., Speech, Signal Process., vol. 24, no. 4, pp. 320–327, 1976

  22. [22]

    An iteratively reweighted steered response power approach to multisource localization using a distributed microphone network,

    X. Dang and H. Zhu, “An iteratively reweighted steered response power approach to multisource localization using a distributed microphone network,”J. Acoust. Soc. Am., vol. 155, no. 2, pp. 1182–1197, 2024

  23. [23]

    SLAM-based joint calibration of multiple asynchronous microphone arrays and sound source localization,

    J. Wang, Y . He, D. Su, K. Itoyama, K. Nakadai, J. Wu, S. Huang, Y . Li, and H. Kong, “SLAM-based joint calibration of multiple asynchronous microphone arrays and sound source localization,”IEEE Trans. Robot., vol. 40, pp. 4024–4044, 2024

  24. [24]

    Steered response power for sound source localization: A tutorial review,

    E. Grinstein, E. Tengan, B. C ¸ akmak, T. Dietzen, L. Nunes, T. van Waterschoot, M. Brookes, and P. A. Naylor, “Steered response power for sound source localization: A tutorial review,”EURASIP J. Audio Speech Music Process., vol. 2024, p. 59, 2024

  25. [25]

    Direction of arrival estimation with microphone arrays using SRP-PHAT and neural net- works,

    D. D ´ıaz-Guerra, A. Miguel, and J. R. Beltr ´an, “Direction of arrival estimation with microphone arrays using SRP-PHAT and neural net- works,” inProc. IEEE Sensor Array and Multichannel Signal Processing Workshop (SAM), 2018, pp. 617–621

  26. [26]

    Hybrid AOA-TDOA localization of a moving source by single receiver,

    D. Pang, G. Wang, and K. C. Ho, “Hybrid AOA-TDOA localization of a moving source by single receiver,”IEEE Trans. Commun., vol. 73, no. 6, pp. 4088–4104, June 2025

  27. [27]

    NLOS error mitigation for TOA-based localization via convex relaxation,

    G. Wang, H. Chen, Y . Li, and N. Ansari, “NLOS error mitigation for TOA-based localization via convex relaxation,”IEEE Trans. Wireless Commun., vol. 13, no. 8, pp. 4119–4131, Aug. 2014

  28. [28]

    Elliptic localization of a moving object by transmitter at unknown position and velocity: A semidefinite relaxation approach,

    G. Wang, R. Zheng, and K. C. Ho, “Elliptic localization of a moving object by transmitter at unknown position and velocity: A semidefinite relaxation approach,”IEEE Trans. Mobile Comput., vol. 22, no. 5, pp. 2675–2692, May 2023

  29. [29]

    Accelerated speech source localiza- tion via a hierarchical search of steered response power,

    D. N. Zotkin and R. Duraiswami, “Accelerated speech source localiza- tion via a hierarchical search of steered response power,”IEEE Trans. Speech Audio Process., vol. 12, no. 5, pp. 957–964, 2004

  30. [30]

    Single-microphone- based sound source localization for mobile robots in reverberant envi- ronments,

    J. Wang, R. Shi, B. Yen, H. Kong, and K. Nakadai, “Single-microphone- based sound source localization for mobile robots in reverberant envi- ronments,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2025, pp. 6135–6140

  31. [31]

    Design of broad-band circular ring microphone array for speech acquisition in 3-D,

    Y . Li, K. C. Ho, and C. Kwan, “Design of broad-band circular ring microphone array for speech acquisition in 3-D,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2003, pp. V-221–V- 224

  32. [32]

    Auditory localization: A comprehensive practical review,

    A. Carlini, C. Bordeau, and M. Ambard, “Auditory localization: A comprehensive practical review,”Front. Psychol., vol. 15, p. 1408073, 2024

  33. [33]

    AuralNet: Hierarchical attention-based 3D binaural localization of overlapping speakers,

    L. Fu, Y . Liu, Z. Liu, Z. Yang, Z.-Q. Wang, Y . Li, and H. Kong, “AuralNet: Hierarchical attention-based 3D binaural localization of overlapping speakers,”Interspeech, 2025

  34. [34]

    A modified SRP-PHAT functional for robust real-time sound source localization with scalable spatial sampling,

    M. Cobos, A. Mart ´ı, and J. J. L´opez, “A modified SRP-PHAT functional for robust real-time sound source localization with scalable spatial sampling,”IEEE Signal Process. Lett., vol. 18, no. 1, pp. 71–74, 2011

  35. [35]

    A volumetric SRP with refinement step for sound source localization,

    M. V . S. Lima, W. A. Martins, L. O. Nunes, L. W. P. Biscainho, T. N. Ferreira, M. V . M. Costa, and B. Lee, “A volumetric SRP with refinement step for sound source localization,”IEEE Signal Process. Lett., vol. 22, no. 8, pp. 1098–1102, 2015

  36. [36]

    A steered-response power algorithm employing hierarchical search for acoustic source localization using microphone arrays,

    L. O. Nunes, W. A. Martins, M. V . S. Lima, L. W. P. Biscainho, M. V . M. Costa, F. M. Gonc ¸alves, A. Said, and B. Lee, “A steered-response power algorithm employing hierarchical search for acoustic source localization using microphone arrays,”IEEE Trans. Signal Process., vol. 62, no. 19, pp. 5171–5183, 2014

  37. [37]

    A real-time SRP-PHAT source location implementation using stochastic region contraction (SRC) on a large-aperture microphone array,

    H. Do, H. F. Silverman, and Y . Yu, “A real-time SRP-PHAT source location implementation using stochastic region contraction (SRC) on a large-aperture microphone array,” inProc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), vol. 1, 2007, pp. 121–124

  38. [38]

    A fast microphone array SRP-PHAT source location implementation using coarse-to-fine region contraction (CFRC),

    H. Do and H. F. Silverman, “A fast microphone array SRP-PHAT source location implementation using coarse-to-fine region contraction (CFRC),” inProc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2007, pp. 295–298

  39. [39]

    A 2-D DOA estimation method with reduced complexity in unfolded coprime L-shaped array,

    Z. Zhang, Y . Guo, Y . Huang, and P. Zhang, “A 2-D DOA estimation method with reduced complexity in unfolded coprime L-shaped array,” IEEE Syst. J., vol. 15, no. 1, pp. 407–410, 2021