arxiv: 2605.06108 · v1 · submitted 2026-05-07 · 📡 eess.AS

Recognition: unknown

NDF+: Joint Neural Directional Filtering and Diffuse Sound Extraction

Emanu\"el A. P. Habets, Le Nhat Tam Huynh, Oliver Thiergart, Weilong Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:47 UTC · model grok-4.3

classification 📡 eess.AS

keywords neural directional filteringdiffuse sound extractionvirtual directional microphonespatial audiodereverberationsound capturestereo recording

0 comments

The pith

NDF+ jointly reconstructs virtual directional microphones and extracts diffuse sound to enable control over reverberation effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes NDF+ as an extension of neural directional filtering that adds diffuse sound extraction as a coupled subtask. This reformulation lets the system adjust diffuse components directly in the final virtual directional microphone output. The model achieves better results than conventional baselines on both directional reconstruction and diffuse extraction while preserving output quality comparable to the original single-task approach. Readers would care because the added control supports applications like stereo recording where users can tune inter-channel differences by changing the diffuse level.

Core claim

NDF+ reformulates VDM estimation into two coupled subtasks of dereverberated VDM reconstruction and diffuse sound extraction. This enables manipulation of diffuse components in the reconstructed VDM output. Under reverberant conditions NDF+ outperforms representative conventional baselines on both subtasks while maintaining VDM reconstruction quality comparable to the original single-task NDF model. In stereo recording applications NDF+ provides controllable inter-channel level differences by adjusting the estimated diffuse component.

What carries the argument

Coupled subtasks of dereverberated virtual directional microphone reconstruction and diffuse sound extraction that together permit independent manipulation of diffuse components in the output.

Load-bearing premise

The neural network can separate directional and diffuse parts of the input signals in reverberant conditions without creating artifacts or forcing a performance trade-off between the two subtasks.

What would settle it

Objective metrics or listening tests on new reverberant recordings where increasing or decreasing the extracted diffuse component produces no measurable change in inter-channel level differences or where overall VDM quality drops below that of the original NDF model.

read the original abstract

Recently, neural directional filtering (NDF) has been introduced as a flexible approach for reconstructing a virtual directional microphone (VDM) with a desired directivity pattern for spatial sound capture. Building on this idea, we propose NDF+, which enables joint neural directional filtering and diffuse sound extraction. NDF+ reformulates VDM estimation into two coupled subtasks: dereverberated VDM reconstruction and diffuse sound extraction. This reformulation enables NDF+ to manipulate diffuse components in the final reconstructed VDM output. We evaluated NDF+ under reverberant conditions and compared it with representative conventional baselines. Results show that NDF+ consistently outperforms the baselines on both subtasks, while maintaining VDM reconstruction quality comparable to that of the original single-task NDF model. These findings indicate that NDF+ introduces an additional degree of freedom for diffuse sound control in the VDM reconstruction. In a stereo recording application, NDF+ provides controllable inter-channel level differences between left and right channels by adjusting the estimated diffuse component.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NDF+ adds a coupled diffuse extraction branch to the prior neural directional filtering work, giving explicit control over diffuse sound in the VDM output, but the abstract supplies no metrics to back the outperformance claims.

read the letter

The main point is that this paper takes the existing NDF approach for virtual directional microphones and splits the task into two coupled parts: dereverberated VDM reconstruction plus diffuse sound extraction. That split lets them adjust how much diffuse component ends up in the final VDM, which they show can tune inter-channel level differences in a stereo recording example. The reformulation itself is a clean way to add that control knob without rebuilding the whole model from scratch, and the claim that VDM quality stays comparable to the single-task version suggests the joint training can avoid obvious performance hits on the directional side. The stereo application is a straightforward practical use case that shows why the extra freedom matters for spatial capture work. The soft spot is the lack of any numbers in the abstract. It says NDF+ outperforms baselines on both subtasks while keeping VDM quality comparable, but there are no quantitative results, no dataset details, no error bars, and no checks for artifacts or trade-offs under reverberation. The stress-test concern about possible leakage or residual reverberation affecting directivity is reasonable here, since real diffuse fields are not perfectly isotropic and the abstract gives no evidence of ablations or listening tests that would confirm clean separation. This is for people already working on neural methods for microphone arrays and spatial audio processing. A reader who knows the original NDF paper would see the incremental value in the control mechanism and the stereo demo. It is not a big theoretical shift, but the joint formulation is coherent enough that the experimental details deserve a look. I would send it for peer review so the full results and any hidden costs in the joint optimization can be checked properly.

Referee Report

2 major / 1 minor

Summary. The paper introduces NDF+, which extends neural directional filtering to jointly perform dereverberated virtual directional microphone (VDM) reconstruction and diffuse sound extraction. This coupled formulation allows manipulation of diffuse components in the VDM output. Under reverberant conditions, NDF+ is reported to outperform representative conventional baselines on both subtasks while maintaining VDM reconstruction quality comparable to the single-task NDF model. An application to stereo recording is shown where adjusting the estimated diffuse component provides controllable inter-channel level differences.

Significance. If the results hold, NDF+ offers an additional degree of freedom for diffuse sound control in VDM reconstruction, which is valuable for spatial sound capture applications. The approach builds on the original NDF by reformulating the task into coupled subtasks without apparent loss in VDM quality. This could enable more flexible audio processing pipelines.

major comments (2)

[Abstract] The central claim that 'NDF+ consistently outperforms the baselines on both subtasks, while maintaining VDM reconstruction quality comparable to that of the original single-task NDF model' is not supported by any quantitative metrics, error bars, specific dataset details, or statistical tests in the abstract. This undermines the ability to assess the soundness of the outperformance and comparability assertions.
[Evaluation under reverberant conditions] No validation is provided that the joint optimization achieves clean separation of directional and diffuse fields without artifacts or trade-offs in reverberant conditions (where the diffuse field is not perfectly isotropic). There is no mention of oracle diffuse references, listening tests for artifacts, or loss-weight ablations to confirm the VDM directivity pattern remains uncompromised.

minor comments (1)

[Abstract] The acronym VDM is introduced without an initial parenthetical expansion in the abstract, although it is clarified later as virtual directional microphone.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major point below and outline revisions to improve clarity and completeness where feasible.

read point-by-point responses

Referee: [Abstract] The central claim that 'NDF+ consistently outperforms the baselines on both subtasks, while maintaining VDM reconstruction quality comparable to that of the original single-task NDF model' is not supported by any quantitative metrics, error bars, specific dataset details, or statistical tests in the abstract. This undermines the ability to assess the soundness of the outperformance and comparability assertions.

Authors: We agree that the abstract, as a concise summary, would benefit from greater specificity to allow readers to better evaluate the claims. The main text provides the supporting quantitative results, dataset details, and comparisons, but we will revise the abstract to incorporate representative performance metrics (e.g., improvements in the relevant error measures for each subtask) and a brief reference to the evaluation conditions. This change will be made without altering the overall length or focus of the abstract. revision: yes
Referee: [Evaluation under reverberant conditions] No validation is provided that the joint optimization achieves clean separation of directional and diffuse fields without artifacts or trade-offs in reverberant conditions (where the diffuse field is not perfectly isotropic). There is no mention of oracle diffuse references, listening tests for artifacts, or loss-weight ablations to confirm the VDM directivity pattern remains uncompromised.

Authors: The manuscript demonstrates that VDM reconstruction quality remains comparable to the single-task NDF baseline under the same reverberant conditions, which provides evidence that the joint formulation does not introduce measurable trade-offs in directivity. Objective metrics on both subtasks further support effective separation. We did not employ oracle diffuse references because the method operates in a blind setting without access to ground-truth diffuse fields. No formal listening tests were conducted, as the evaluation prioritized quantitative comparisons with conventional baselines. We will add a short discussion of the isotropic diffuse-field assumption and its limitations in reverberation, along with a loss-weight sensitivity analysis to confirm stability of the VDM pattern. New subjective listening tests, however, would require additional resources and are noted as future work. revision: partial

standing simulated objections not resolved

New subjective listening tests and access to oracle diffuse references cannot be provided without collecting additional data and conducting new experiments beyond the scope of the current manuscript.

Circularity Check

0 steps flagged

NDF+ extends prior NDF via joint optimization with no derivation reducing to self-referential inputs

full rationale

The paper proposes a neural architecture reformulating VDM estimation as coupled dereverberated VDM reconstruction and diffuse extraction, then reports empirical outperformance on reverberant test conditions against conventional baselines while preserving single-task VDM quality. No equations, uniqueness theorems, or fitted-parameter renamings are presented that would make the claimed gains equivalent to the training inputs by construction. The central results rest on data-driven evaluation rather than a closed mathematical loop. A citation to the original NDF work appears but is not load-bearing for the new joint-task claims, which are validated independently.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on a data-driven neural network whose weights are fitted during training and on the domain assumption that microphone signals can be usefully decomposed into directional and diffuse parts.

free parameters (1)

Neural network weights and hyperparameters
Learned from training data to perform the joint filtering and extraction; exact count and values not stated in abstract.

axioms (1)

domain assumption Audio signals in reverberant rooms can be decomposed into directional and diffuse components that a neural network can jointly estimate.
Invoked by the reformulation of VDM estimation into the two coupled subtasks.

pith-pipeline@v0.9.0 · 5490 in / 1229 out tokens · 65508 ms · 2026-05-08T03:47:28.815171+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 3 canonical work pages · 1 internal anchor

[1]

INTRODUCTION Fixed beamformer (FBF) with an appropriate directivity pattern en- ables precise spatial rendering of sound sources and preserves key spatial cues, even in multi-source scenarios. However, conventional FBFs, such as differential microphone array (DMA) [1, 2] and su- perdirective beamforming [3], are fundamentally limited by a com- pact array ...
[2]

NDF+: Joint Neural Directional Filtering and Diffuse Sound Extraction

PROBLEM FORMULATION We consider a compact array ofQomnidirectional microphones recording an acoustic scene withNsound sources in a reverber- ant room. The array and all sources are assumed to lie in thex- yplane. LetX q,n(f, t)denote the short-time Fourier transform (STFT) coefficient at theq-th microphone due to then-th source, wherefandtdenote the frequ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

PROPOSED METHOD 3.1. DNN Architecture and Training Loss The FT-JNF framework [14] used for the NDF task [5, 7] employs two distinct long short-term memory (LSTM) networks to esti- mate a single complex-valued mask and applies it to a reference channel to estimate a wanted signal. To accommodate estimates for two distinct targets (Z coh(f, t)andZ diff(f, t...
[4]

The reference microphone signal served as the first input channel for the NDF+ model

EXPERIMENTAL SETUP Configurations: A four-microphone array (Q= 4, diameter3 cm) was used, consisting of three microphones arranged in a uniform circular array (UCA) and one positioned at the center as the refer- ence microphone. The reference microphone signal served as the first input channel for the NDF+ model. The target direction of the directivity pa...
[5]

The STFT used a 512-point window and a 256-point hop size. Performance measures: We used the signal-to-distortion ratio (SDR) [20] and perceptual evaluation of speech quality (PESQ) [21, 22] to measure distance between estimated signals and target signals. The obtained directivity patterns were estimated using the method described in [7]
[6]

Performance Analysis The proposed NDF+ model jointly addresses two explicit subtasks: dereverberated VDM reconstruction ( bZcoh) and diffuse sound ex- traction ( bZdiff)

EXPERIMENTAL RESULTS 5.1. Performance Analysis The proposed NDF+ model jointly addresses two explicit subtasks: dereverberated VDM reconstruction ( bZcoh) and diffuse sound ex- traction ( bZdiff). By achieving both, it implicitly realizes the VDM reconstruction task ( bZvdm) using (7). Table 2 presents the results for variousRT60values. For VDM reconstruc...
[7]

NDF+ splits the VDM estimation into dereverberated VDM reconstruction and diffuse sound extrac- tion

CONCLUSIONS We introduce NDF+, a joint framework for neural directional filter- ing and diffuse sound extraction. NDF+ splits the VDM estimation into dereverberated VDM reconstruction and diffuse sound extrac- tion. It consistently outperforms baselines on both tasks and matches the single-task NDF model for VDM reconstruction. Joint optimiza- tion mainta...
[8]

The hardware is funded by the German Research Foundation (DFG)

ACKNOWLEDGMENTS The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Perfor- mance Computing Center (NHR@FAU) of the Friedrich-Alexander- Universit¨at Erlangen-N ¨urnberg (FAU). The hardware is funded by the German Research Foundation (DFG)
[9]

6, Springer Science & Business Media, 2012

Jacob Benesty and Chen Jingdong,Study and design of differ- ential microphone arrays, vol. 6, Springer Science & Business Media, 2012

2012
[10]

12, Springer, 2015

Jacob Benesty, Jingdong Chen, and Israel Cohen,Design of circular differential microphone arrays, vol. 12, Springer, 2015

2015
[11]

Superdirective microphone arrays,

Joerg Bitzer and K Uwe Simmer, “Superdirective microphone arrays,” inMicrophone arrays: Signal processing techniques and applications, pp. 19–38. Springer, 2001

2001
[12]

Fixed beamforming,

Jacob Benesty, Israel Cohen, and Jingdong Chen, “Fixed beamforming,”Fundamentals of Signal Enhancement and Ar- ray Signal Processing, pp. 237–282, 2018

2018
[13]

Neural Directional Filtering: Far-field directivity control with a small microphone array,

Julian Wechsler, Srikanth Raj Chetupalli, Mhd Modar Hal- imeh, Oliver Thiergart, and Emanu ¨el A. P. Habets, “Neural Directional Filtering: Far-field directivity control with a small microphone array,” inProc. Intl. Workshop Acoust. Signal En- hancement (IWAENC). IEEE, 2024, pp. 459–463

2024
[14]

Steerable neural directional filtering,

Weilong Huang, Mhd Modar Halimeh, Srikanth Raj Chetu- palli, Oliver Thiergart, and Emanu ¨el AP Habets, “Steerable neural directional filtering,” inProc. of the F orum Acusticum Euronoise, European Acoustics Association, 2025

2025
[15]

Neural directional filtering using a compact microphone array,

Weilong Huang, Srikanth Raj Chetupalli, Mhd Modar Hal- imeh, Oliver Thiergart, and Emanu ¨el A. P. Habets, “Neural directional filtering using a compact microphone array,”arXiv preprint arXiv:2511.07185, 2025

work page arXiv 2025
[16]

Neural directional filtering with configurable directivity pattern at inference,

Weilong Huang, Srikanth Raj Chetupalli, and Emanu¨el AP Ha- bets, “Neural directional filtering with configurable directivity pattern at inference,”arXiv preprint arXiv:2510.20253, 2025

work page arXiv 2025
[17]

Jens Blauert,Spatial hearing: the psychophysics of human sound localization, MIT press, 1997

1997
[18]

Source localization in complex listening situations: Selection of binaural cues based on interaural coherence,

Christof Faller and Juha Merimaa, “Source localization in complex listening situations: Selection of binaural cues based on interaural coherence,”The Journal of the Acoustical Society of America, vol. 116, no. 5, pp. 3075–3089, 2004

2004
[19]

On multiplicative transfer function approximation in the short-time fourier transform do- main,

Yekutiel Avargel and Israel Cohen, “On multiplicative transfer function approximation in the short-time fourier transform do- main,”IEEE Signal Process. Lett., vol. 14, no. 5, pp. 337–340, 2007

2007
[20]

Superdirectional microphone arrays,

Gary W Elko, “Superdirectional microphone arrays,”Acoustic signal processing for telecommunication, pp. 181–237, 2000

2000
[21]

John Eargle,The Microphone Book: From mono to stereo to surround-a guide to microphone design and application, Rout- ledge, 2012

2012
[22]

Insights into deep non- linear filters for improved multi-channel speech enhancement,

Kristina Tesch and Timo Gerkmann, “Insights into deep non- linear filters for improved multi-channel speech enhancement,” IEEE Trans. Audio, Speech, Lang. Process., 2023

2023
[23]

RIR generator,

Emanu ¨el A. P. Habets, “RIR generator,”https://github. com/ehabets/RIR-Generator, 2020, commit 3cf914d

2020
[24]

Monte carlo RIR sim- ulation,

Emanu ¨el A. P. Habets, “Monte carlo RIR sim- ulation,”https://github.com/audiolabs/ MonteCarloRIRSimulation, 2026, commit d464a10

2026
[25]

LibriSpeech: An ASR corpus based on public domain audio books,

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “LibriSpeech: An ASR corpus based on public domain audio books,” inProc. IEEE Intl. Conf. on Acous- tics, Speech and Signal Processing (ICASSP), 2015, pp. 5206– 5210

2015
[26]

EARS: An anechoic fullband speech dataset benchmarked for speech enhancement and dereverberation,

Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinjii Watanabe, Alexander Richard, and Timo Gerkmann, “EARS: An anechoic fullband speech dataset benchmarked for speech enhancement and dereverberation,” in Proc. Interspeech Conf., 2024, pp. 4873–4877

2024
[27]

Recommendation ITU-R BS.1770-5: Algorithms to measure audio programme loudness and true-peak audio level,

ITU-R, “Recommendation ITU-R BS.1770-5: Algorithms to measure audio programme loudness and true-peak audio level,” 2023

2023
[28]

Performance measurement in blind audio source separation,

Emmanuel Vincent, R ´emi Gribonval, and C ´edric F ´evotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1462–1469, 2006

2006
[29]

PESQ for P.862.2,

Emanu ¨el A. P. Habets Matteo Torcoli, Mhd Modar Hal- imeh, “PESQ for P.862.2,”https://github.com/ audiolabs/PESQ, 2025, commit d11671a

2025
[30]

Navigating PESQ: Up-to-date versions and open imple- mentations,

Matteo Torcoli, Mhd Modar Halimeh, and Emanu ¨el A. P. Ha- bets, “Navigating PESQ: Up-to-date versions and open imple- mentations,” inSpeech Communication; 16th ITG Conference. VDE, 2025, pp. 51–55

2025
[31]

Adaptive dereverberation of speech signals with speaker-position change detection,

Takuya Yoshioka, Hideyuki Tachibana, Tomohiro Nakatani, and Masato Miyoshi, “Adaptive dereverberation of speech signals with speaker-position change detection,” inProc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2009, pp. 3733–3736

2009
[32]

A practical online multichannel dereverberation ap- proach with data-reuse technique,

Weilong Huang, Cheng Xue, Jinwei Feng, and W Bastiaan Kleijn, “A practical online multichannel dereverberation ap- proach with data-reuse technique,” inProc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 501–505

2024
[33]

Extracting rever- berant sound using a linearly constrained minimum variance spatial filter,

Oliver Thiergart and Emanu ¨el A. P. Habets, “Extracting rever- berant sound using a linearly constrained minimum variance spatial filter,”IEEE Signal Process. Lett., 2014

2014
[34]

DAS generator,

Emanu ¨el A. P. Habets, “DAS generator,”https:// github.com/ehabets/das-generator, 2025, com- mit 6f2cd6d

2025
[35]

The stereophonic zoom,

Michael Williams, “The stereophonic zoom,”Rycote Mi- crophone Windshields Ltd and Human Computer Interface, Gloucestershire (UK), 2002

2002