arxiv: 2604.25387 · v1 · submitted 2026-04-28 · 📡 eess.AS · cs.RO

Recognition: unknown

ASAP: An Azimuth-Priority Strip-Based Search Approach to Planar Microphone Array DOA Estimation in 3D

Ming Huang , Shuting Xu , Leying Yang , Huanzhang Hu , Yujie Zhang , Jiang Wang , Yu Liu , Hao Zhao

show 1 more author

He Kong

Authors on Pith no claims yet

Pith reviewed 2026-05-07 14:24 UTC · model grok-4.3

classification 📡 eess.AS cs.RO

keywords DOA estimationSRP-PHATplanar microphone arrayazimuth search3D direction findingregion contractionreal-time localization

0 comments

The pith

ASAP locks azimuth first in strips then refines elevation on arcs to reduce SRP-PHAT evaluations for planar arrays.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ASAP to solve the computational bottleneck of full-grid SRP-PHAT searches that evaluate thousands of directions for accurate 3D DOA with planar microphone arrays. It exploits the observation that azimuth estimates are typically more reliable than elevation by first performing coarse-to-fine contraction inside azimuthal strips, locking candidate azimuths while preserving multiple peaks inside spherical caps. The second stage then searches only along the great-circle arc connecting the two closest retained candidates to refine elevation. If effective, this cuts the number of steering-response calculations enough to support real-time 3D tracking on robots and other resource-limited hardware while keeping accuracy close to exhaustive search.

Core claim

ASAP performs coarse-to-fine region contraction within azimuthal strips to lock azimuth angles while retaining multiple maxima through spherical caps in the first stage; it then refines elevation along the great-circle arc between two close candidates in the second stage.

What carries the argument

Azimuth-priority strip contraction that retains maxima inside spherical caps followed by great-circle-arc elevation refinement.

Load-bearing premise

Azimuth estimation remains reliably more accurate than elevation estimation for most planar microphone arrays.

What would settle it

A controlled test in which the true source direction lies outside all retained spherical caps after the first-stage azimuth contraction, causing the second stage to miss the correct elevation.

Figures

Figures reproduced from arXiv: 2604.25387 by Hao Zhao, He Kong, Huanzhang Hu, Jiang Wang, Leying Yang, Ming Huang, Shuting Xu, Yujie Zhang, Yu Liu.

**Figure 1.** Figure 1: Overall architecture of the proposed ASAP framework. Stage 1 ˆ view at source ↗

**Figure 3.** Figure 3: Comparison of RMSE under different source distances (1 m, 2 m, view at source ↗

**Figure 4.** Figure 4: Experimental platform setup. An 8-microphone uniform circular array view at source ↗

read the original abstract

Direction-of-arrival (DOA) estimation is an important task in microphone array processing and many downstream applications. The steered response power with phase transform (SRP-PHAT) method has been widely adopted for DOA estimation in recent years. However, accurate SRP-PHAT estimation in 3D scenarios requires evaluating steering responses over thousands of candidate directions, severely limiting real-time performance on resource-constrained platforms. This challenge becomes even more critical for planar arrays, which are widely used in robotics due to their structural simplicity. Motivated by the fact that azimuth estimation is usually more reliable than elevation estimation for most arrays, we propose ASAP, an azimuth-priority strip-based search approach to planar microphone array DOA estimation in 3D. In the first stage, ASAP performs coarse-to-fine region contraction within azimuthal strips to lock azimuth angles while retaining multiple maxima through spherical caps. In the second stage, it refines elevation along the great-circle arc between two close candidates. Extensive simulations and real-world experiments validate the efficiency and merits of the proposed method over existing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ASAP gives a concrete two-stage strip search that trims SRP-PHAT evaluations for planar-array 3D DOA by locking azimuth first, but the gain depends on an unproven assumption that elevation mismatch won't push the true peak out of the retained caps.

read the letter

The paper's main contribution is a specific search procedure: coarse-to-fine contraction inside azimuthal strips while holding multiple maxima with spherical caps, followed by elevation refinement along the great-circle arc between close candidates. This is not a generic grid reduction; the geometry and the azimuth-first ordering are the new pieces. It directly targets the computational cost of full 3D SRP-PHAT on planar arrays, which matters for robotics where those arrays are common and real-time matters. The motivation is stated plainly and the steps are described without unnecessary equations, which makes the method easy to implement from the text. That is useful work. The experiments are mentioned but the abstract supplies no numbers, error bars, or baseline tables, so the efficiency and accuracy claims remain assertions until the full results are checked. The bigger concern is whether the first stage can reliably keep the true direction inside the caps. Planar arrays can produce elevation-dependent sidelobes or broadened lobes in the SRP-PHAT surface; if a coarse azimuth strip misses because of that, the later arc step has nothing to recover. The paper notes the usual reliability of azimuth but gives no analytic tolerance on elevation error or any failure-case analysis. If the simulations and real recordings cover those cases and still show gains, the method holds; without that evidence the central shortcut rests on an assumption that may not be safe across all geometries. This is for people who build real-time acoustic localization on embedded hardware and already use SRP-PHAT. A reader who needs a drop-in faster search for planar arrays will find the algorithm description worth reading. It is worth sending to peer review so the experiments can be examined and the robustness question can be settled.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ASAP, an azimuth-priority strip-based search algorithm for efficient 3D DOA estimation with planar microphone arrays via SRP-PHAT. Motivated by the observation that azimuth estimates are typically more reliable than elevation estimates, the method uses a two-stage procedure: coarse-to-fine region contraction inside azimuthal strips to lock azimuth while retaining multiple candidate maxima through spherical caps, followed by elevation refinement along the great-circle arc connecting the two closest candidates. The authors state that simulations and real-world experiments demonstrate improved efficiency and performance relative to existing approaches.

Significance. If the accuracy claims hold, the work offers a practical, low-complexity alternative to exhaustive 3D grid searches for SRP-PHAT on planar arrays, which are common in robotics and other resource-constrained settings. The strip-based contraction exploits a domain-specific reliability difference to reduce the number of steering-vector evaluations without requiring new hardware or array geometries.

major comments (2)

Abstract: the claim that 'extensive simulations and real-world experiments validate the efficiency and merits' is unsupported by any quantitative metrics, error bars, RMSE tables, or runtime comparisons in the provided text. This omission leaves the central performance claim (efficiency without accuracy loss) without visible evidence.
First-stage description (azimuthal-strip contraction): the procedure retains multiple maxima via spherical caps but supplies no analytic bound on the maximum elevation mismatch that can be tolerated before the true azimuth peak falls outside the contracted strip. For planar arrays the SRP-PHAT surface can exhibit elevation-dependent grating lobes or broadened main lobes; if the initial coarse strip misses the global maximum, the subsequent arc refinement cannot recover it.

minor comments (1)

The motivation paragraph would benefit from a brief citation to prior literature quantifying the relative reliability of azimuth versus elevation for planar arrays.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We have addressed each point below and revised the manuscript to improve clarity and strengthen the supporting evidence.

read point-by-point responses

Referee: Abstract: the claim that 'extensive simulations and real-world experiments validate the efficiency and merits' is unsupported by any quantitative metrics, error bars, RMSE tables, or runtime comparisons in the provided text. This omission leaves the central performance claim (efficiency without accuracy loss) without visible evidence.

Authors: We agree that the abstract would benefit from more specific quantitative support. Although the full manuscript presents detailed simulation and experimental results with RMSE comparisons and runtime measurements in Sections IV and V, these were not highlighted in the abstract. We have revised the abstract to reference the key quantitative outcomes from our evaluations, and we have added a concise performance summary table to the revised manuscript to make the efficiency and accuracy claims directly visible. revision: yes
Referee: First-stage description (azimuthal-strip contraction): the procedure retains multiple maxima via spherical caps but supplies no analytic bound on the maximum elevation mismatch that can be tolerated before the true azimuth peak falls outside the contracted strip. For planar arrays the SRP-PHAT surface can exhibit elevation-dependent grating lobes or broadened main lobes; if the initial coarse strip misses the global maximum, the subsequent arc refinement cannot recover it.

Authors: We thank the referee for highlighting this important robustness consideration. The original manuscript provides empirical validation through simulations but does not include an analytic bound. In the revised version, we have added analysis in Section III that derives an approximate tolerance bound on elevation mismatch based on the expected main-lobe width of the SRP-PHAT spectrum for planar arrays. We have also included additional simulation results quantifying the capture rate of the true azimuth under varying elevation errors, confirming high reliability within practical operating ranges. We acknowledge that extreme grating-lobe scenarios could still cause the coarse stage to miss the global peak, and we have added a brief discussion of this limitation. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural algorithm with independent motivation

full rationale

The paper describes a two-stage coarse-to-fine search heuristic for SRP-PHAT DOA estimation on planar arrays. The first stage contracts azimuthal strips while retaining maxima via spherical caps; the second refines elevation along great-circle arcs. This structure is presented as a direct algorithmic procedure motivated by the stated domain observation that azimuth is typically more reliable than elevation. No equations appear that define a quantity in terms of itself, no parameters are fitted to a data subset and then relabeled as predictions, and no uniqueness theorems or ansatzes are imported via self-citation. Validation rests on external simulations and real-world experiments rather than internal reduction, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach builds on standard SRP-PHAT without introducing new physical entities or fitted constants visible here.

pith-pipeline@v0.9.0 · 5518 in / 1049 out tokens · 43893 ms · 2026-05-07T14:24:29.287134+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references

[1]

Robust beamforming for multispeaker audio conferencing under DOA uncertainty,

G. Itzhak and I. Cohen, “Robust beamforming for multispeaker audio conferencing under DOA uncertainty,”IEEE/ACM Trans. Audio Speech Lang. Process., vol. 33, pp. 139–151, 2025

2025
[2]

Informed vs. blind beamforming in ad-hoc acoustic sensor networks for meeting transcription,

T. Gburrek, J. Schmalenstroeer, J. Heitkaemper, and R. Haeb-Umbach, “Informed vs. blind beamforming in ad-hoc acoustic sensor networks for meeting transcription,” inProc. 17th International Workshop on Acoustic Signal Enhancement (IWAENC), 2022, pp. 1–5

2022
[3]

Meeting transcription using asynchronous distant mi- crophones,

T. Yoshioka, D. Dimitriadis, A. Stolcke, W. Hinthorn, Z. Chen, M. Zeng, and X. Huang, “Meeting transcription using asynchronous distant mi- crophones,” inProc. Interspeech, 2019, pp. 2968–2972

2019
[4]

Listen to extract: Onset-prompted target speaker extraction,

P. Shen, K. Chen, S. He, P. Chen, S. Yuan, H. Kong, X. Zhang, and Z.- Q. Wang, “Listen to extract: Onset-prompted target speaker extraction,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 33, pp. 4832– 4843, 2025

2025
[5]

A microphone array system for automatic fall detection,

Y . Li, K. C. Ho, and M. Popescu, “A microphone array system for automatic fall detection,”IEEE Trans. Biomed. Eng., vol. 59, no. 5, pp. 1291–1301, May 2012

2012
[6]

An accurate algebraic closed-form solution for energy-based source localization,

K. C. Ho and M. Sun, “An accurate algebraic closed-form solution for energy-based source localization,”IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 8, pp. 2542–2550, Nov. 2007

2007
[7]

A survey of sound source localization with deep learning methods,

P.-A. Grumiaux, S. Kiti ´c, L. Girin, and A. Gu ´erin, “A survey of sound source localization with deep learning methods,”The Journal of the Acoustical Society of America, vol. 152, no. 1, pp. 107–151, 2022

2022
[8]

Observability-driven assignment of heterogeneous sensors for multi-target tracking,

S. A. Rakhshan, M. Golestani, and H. Kong, “Observability-driven assignment of heterogeneous sensors for multi-target tracking,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2025, pp. 12100–12107

2025
[9]

Manifold- optimization-based 3D sound source mapping with unknown camera- microphone array relative pose,

J. Wang, R. Shi, J. Li, H. Kong, and K. Nakadai, “Manifold- optimization-based 3D sound source mapping with unknown camera- microphone array relative pose,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2026, pp. 21026–21030

2026
[10]

Improved extrinsic calibration of acoustic cameras via batch optimization,

Z. Li, J. Wang, X. Li, and H. Kong, “Improved extrinsic calibration of acoustic cameras via batch optimization,” inProc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Apr 2025

2025
[11]

I-ASM: Iterative acoustic scene mapping for enhanced robot auditory perception in complex indoor environments,

L. Fu, Y . He, J. Wang, X. Qiao, and H. Kong, “I-ASM: Iterative acoustic scene mapping for enhanced robot auditory perception in complex indoor environments,” inProc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), 2024, pp. 12 318–12 323

2024
[12]

Visualization and quantification of the activities of animal vocalizations in forest species using robot audition techniques,

H. Zhao, R. Suzuki, S. Sumitani, S. Matsubayashi, T. Arita, K. Nakadai, and H. G. Okuno, “Visualization and quantification of the activities of animal vocalizations in forest species using robot audition techniques,” Journal of Ecoacoustics, vol. 7, no. 2, 2023

2023
[13]

Observability- aware active calibration of multisensor extrinsics for ground robots via online trajectory optimization,

J. Wang, Y . Kang, L. Fu, K. Nakadai, and H. Kong, “Observability- aware active calibration of multisensor extrinsics for ground robots via online trajectory optimization,”IEEE Sensors Journal, vol. 25, no. 17, pp. 33022–33036, 2025

2025
[14]

Optimal sensor placement for full-set TDOA localization accounting for sensor location errors,

C. Zhang, X. Han, H. Kong, and K. C. Ho, “Optimal sensor placement for full-set TDOA localization accounting for sensor location errors,” IEEE Trans. Aerospace and Electronic Systems, vol. 61, no. 4, pp. 10944–10950, 2025

2025
[15]

SAGENet: Binaural echo-based 3D depth estimation with sparse an- gular queries and refined geometric cues,

G. Liu, W. Cui, Y . Xi, L. Yang, P. Hu, H. Kong, and Z. Wang, “SAGENet: Binaural echo-based 3D depth estimation with sparse an- gular queries and refined geometric cues,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2025, pp. 6113–6120

2025
[16]

Necessary and sufficient conditions for observability of SLAM-based TDOA sensor array cali- bration and source localization,

D. Su, H. Kong, S. Sukkarieh, and S. Huang, “Necessary and sufficient conditions for observability of SLAM-based TDOA sensor array cali- bration and source localization,”IEEE Trans. Robot., vol. 37, no. 5, pp. 1451–1468, 2021

2021
[17]

Calibration of multiple asynchronous microphone arrays using hybrid TDOA,

C. Zhang, W. Pan, X. Han, and H. Kong, “Calibration of multiple asynchronous microphone arrays using hybrid TDOA,” inProc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Apr 2025

2025
[18]

Robots have been seen and not heard: Effects of consequential sounds on human-perception of robots,

A. Allen, T. Drummond, and D. Kuli ´c, “Robots have been seen and not heard: Effects of consequential sounds on human-perception of robots,” IEEE Robotics and Automation Letters, vol. 10, no. 4, pp. 3980–3987, 2025

2025
[19]

A high-accuracy, low-latency technique for talker lo- calization in reverberant environments using microphone arrays,

J. H. DiBiase, “A high-accuracy, low-latency technique for talker lo- calization in reverberant environments using microphone arrays,” Ph.D. dissertation, Brown University, 2000

2000
[20]

A generalized steered response power method for computationally viable source localization,

J. P. Dmochowski, J. Benesty, and S. Affes, “A generalized steered response power method for computationally viable source localization,” IEEE Trans. Audio Speech Lang. Process., vol. 15, no. 8, pp. 2510–2526, Nov 2007

2007
[21]

The generalized correlation method for estimation of time delay,

C. H. Knapp and G. C. Carter, “The generalized correlation method for estimation of time delay,”IEEE Trans. Acoust., Speech, Signal Process., vol. 24, no. 4, pp. 320–327, 1976

1976
[22]

An iteratively reweighted steered response power approach to multisource localization using a distributed microphone network,

X. Dang and H. Zhu, “An iteratively reweighted steered response power approach to multisource localization using a distributed microphone network,”J. Acoust. Soc. Am., vol. 155, no. 2, pp. 1182–1197, 2024

2024
[23]

SLAM-based joint calibration of multiple asynchronous microphone arrays and sound source localization,

J. Wang, Y . He, D. Su, K. Itoyama, K. Nakadai, J. Wu, S. Huang, Y . Li, and H. Kong, “SLAM-based joint calibration of multiple asynchronous microphone arrays and sound source localization,”IEEE Trans. Robot., vol. 40, pp. 4024–4044, 2024

2024
[24]

Steered response power for sound source localization: A tutorial review,

E. Grinstein, E. Tengan, B. C ¸ akmak, T. Dietzen, L. Nunes, T. van Waterschoot, M. Brookes, and P. A. Naylor, “Steered response power for sound source localization: A tutorial review,”EURASIP J. Audio Speech Music Process., vol. 2024, p. 59, 2024

2024
[25]

Direction of arrival estimation with microphone arrays using SRP-PHAT and neural net- works,

D. D ´ıaz-Guerra, A. Miguel, and J. R. Beltr ´an, “Direction of arrival estimation with microphone arrays using SRP-PHAT and neural net- works,” inProc. IEEE Sensor Array and Multichannel Signal Processing Workshop (SAM), 2018, pp. 617–621

2018
[26]

Hybrid AOA-TDOA localization of a moving source by single receiver,

D. Pang, G. Wang, and K. C. Ho, “Hybrid AOA-TDOA localization of a moving source by single receiver,”IEEE Trans. Commun., vol. 73, no. 6, pp. 4088–4104, June 2025

2025
[27]

NLOS error mitigation for TOA-based localization via convex relaxation,

G. Wang, H. Chen, Y . Li, and N. Ansari, “NLOS error mitigation for TOA-based localization via convex relaxation,”IEEE Trans. Wireless Commun., vol. 13, no. 8, pp. 4119–4131, Aug. 2014

2014
[28]

Elliptic localization of a moving object by transmitter at unknown position and velocity: A semidefinite relaxation approach,

G. Wang, R. Zheng, and K. C. Ho, “Elliptic localization of a moving object by transmitter at unknown position and velocity: A semidefinite relaxation approach,”IEEE Trans. Mobile Comput., vol. 22, no. 5, pp. 2675–2692, May 2023

2023
[29]

Accelerated speech source localiza- tion via a hierarchical search of steered response power,

D. N. Zotkin and R. Duraiswami, “Accelerated speech source localiza- tion via a hierarchical search of steered response power,”IEEE Trans. Speech Audio Process., vol. 12, no. 5, pp. 957–964, 2004

2004
[30]

Single-microphone- based sound source localization for mobile robots in reverberant envi- ronments,

J. Wang, R. Shi, B. Yen, H. Kong, and K. Nakadai, “Single-microphone- based sound source localization for mobile robots in reverberant envi- ronments,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2025, pp. 6135–6140

2025
[31]

Design of broad-band circular ring microphone array for speech acquisition in 3-D,

Y . Li, K. C. Ho, and C. Kwan, “Design of broad-band circular ring microphone array for speech acquisition in 3-D,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2003, pp. V-221–V- 224

2003
[32]

Auditory localization: A comprehensive practical review,

A. Carlini, C. Bordeau, and M. Ambard, “Auditory localization: A comprehensive practical review,”Front. Psychol., vol. 15, p. 1408073, 2024

2024
[33]

AuralNet: Hierarchical attention-based 3D binaural localization of overlapping speakers,

L. Fu, Y . Liu, Z. Liu, Z. Yang, Z.-Q. Wang, Y . Li, and H. Kong, “AuralNet: Hierarchical attention-based 3D binaural localization of overlapping speakers,”Interspeech, 2025

2025
[34]

A modified SRP-PHAT functional for robust real-time sound source localization with scalable spatial sampling,

M. Cobos, A. Mart ´ı, and J. J. L´opez, “A modified SRP-PHAT functional for robust real-time sound source localization with scalable spatial sampling,”IEEE Signal Process. Lett., vol. 18, no. 1, pp. 71–74, 2011

2011
[35]

A volumetric SRP with refinement step for sound source localization,

M. V . S. Lima, W. A. Martins, L. O. Nunes, L. W. P. Biscainho, T. N. Ferreira, M. V . M. Costa, and B. Lee, “A volumetric SRP with refinement step for sound source localization,”IEEE Signal Process. Lett., vol. 22, no. 8, pp. 1098–1102, 2015

2015
[36]

A steered-response power algorithm employing hierarchical search for acoustic source localization using microphone arrays,

L. O. Nunes, W. A. Martins, M. V . S. Lima, L. W. P. Biscainho, M. V . M. Costa, F. M. Gonc ¸alves, A. Said, and B. Lee, “A steered-response power algorithm employing hierarchical search for acoustic source localization using microphone arrays,”IEEE Trans. Signal Process., vol. 62, no. 19, pp. 5171–5183, 2014

2014
[37]

A real-time SRP-PHAT source location implementation using stochastic region contraction (SRC) on a large-aperture microphone array,

H. Do, H. F. Silverman, and Y . Yu, “A real-time SRP-PHAT source location implementation using stochastic region contraction (SRC) on a large-aperture microphone array,” inProc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), vol. 1, 2007, pp. 121–124

2007
[38]

A fast microphone array SRP-PHAT source location implementation using coarse-to-fine region contraction (CFRC),

H. Do and H. F. Silverman, “A fast microphone array SRP-PHAT source location implementation using coarse-to-fine region contraction (CFRC),” inProc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2007, pp. 295–298

2007
[39]

A 2-D DOA estimation method with reduced complexity in unfolded coprime L-shaped array,

Z. Zhang, Y . Guo, Y . Huang, and P. Zhang, “A 2-D DOA estimation method with reduced complexity in unfolded coprime L-shaped array,” IEEE Syst. J., vol. 15, no. 1, pp. 407–410, 2021

2021