arxiv: 2605.10398 · v1 · submitted 2026-05-11 · 📡 eess.AS

Recognition: 2 theorem links

· Lean Theorem

SF-Flow: Sound field magnitude estimation via flow matching guided by sparse measurements

Ege Erdem, Orchisama Das, Shoichi Koyama, Tomohiko Nakamura, Zoran Cvetkovi\'c

Pith reviewed 2026-05-12 03:31 UTC · model grok-4.3

classification 📡 eess.AS

keywords sound field reconstructionacoustic transfer functionflow matchingsparse measurements3D U-Netpermutation-invariant encoderspatial audiogenerative modeling

0 comments

The pith

Flow matching reconstructs 3D sound field magnitudes from sparse microphone measurements up to 1 kHz.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames reconstruction of acoustic transfer function magnitudes in three dimensions as a guided generative task solved by flow matching. A 3D U-Net receives conditioning from a permutation-invariant encoder that ingests any number of microphone readings and outputs the full magnitude field. This matters for room characterization and audio correction because it replaces the need for dense sensor arrays with a smaller set of measurements. The method trains faster than an autoencoder baseline and its accuracy rises as the training dataset grows larger. Experiments confirm reliable performance through 1 kHz under the tested conditions.

Core claim

We propose SF-Flow, a framework that treats 3D ATF magnitude reconstruction as a guided generation task using flow matching. The model employs a 3D U-Net conditioned by a permutation-invariant set encoder that handles an arbitrary number of sparse microphone measurements. This enables stable and efficient training compared to autoencoder baselines, achieving accurate reconstructions up to 1 kHz that improve with increasing dataset sizes.

What carries the argument

Flow matching as a guided generation process on a 3D U-Net conditioned by a permutation-invariant set encoder that accepts sparse microphone inputs of variable count.

Load-bearing premise

The flow matching process guided by the permutation-invariant set encoder on a 3D U-Net can reliably recover the underlying acoustic properties from sparse measurements without introducing artifacts or failing at higher frequencies or complex geometries.

What would settle it

Direct comparison of reconstructed magnitudes against ground-truth measurements in a room with non-convex geometry showing large errors above 1 kHz or no advantage over the autoencoder baseline.

read the original abstract

Reconstructing a 3D sound field from sparse microphone measurements is a fundamental yet ill-posed problem, which we address through Acoustic Transfer Function (ATF) magnitude estimation. ATF magnitude encapsulates key perceptual and acoustic properties of a physical space with applications in room characterization and correction. Although recent generative paradigms such as Flow Matching (FM) have achieved state-of-the-art performance in speech and music generation, their potential in spatial audio remains underexplored. We propose a novel framework for 3D ATF magnitude reconstruction as a guided generation task, with a 3D U-Net conditioned by a permutation-invariant set encoder. This architecture enables reconstruction from an arbitrary number of sparse inputs while leveraging the stable and efficient training properties of FM. Experimental results demonstrate that SF-Flow achieves accurate reconstruction up to \SI{1}{kHz}, trains substantially faster than the autoencoder baseline, and improves significantly with dataset size.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SF-Flow brings flow matching to 3D ATF magnitude reconstruction with a set encoder that handles arbitrary sparse inputs, and the reported gains over autoencoders look real but stay narrow in frequency and scope.

read the letter

The main point is that this paper takes flow matching, which has worked well for audio generation, and applies it to reconstructing 3D acoustic transfer function magnitudes from any number of sparse microphone measurements. The 3D U-Net conditioned by a permutation-invariant set encoder is the practical piece that lets the model accept variable input counts without retraining or padding tricks. That directly tackles a common headache in real measurement setups. The experiments claim accurate results up to 1 kHz, noticeably faster training than the autoencoder baseline, and clear improvement as the dataset gets larger, which suggests the generative framing helps with the ill-posed nature of the problem. Those are the concrete advances here. The architecture choice and the scaling behavior are the parts that feel like genuine progress rather than routine tweaks. The frequency cap and magnitude-only focus keep the contribution inside a specific niche, but within that niche the method addresses the variable-cardinality issue cleanly. I don't see load-bearing assumptions that collapse on inspection, and the claims line up with the described approach without obvious circularity. The work is aimed at researchers in room acoustics and spatial audio who need reconstruction tools that work with irregular sensor placements. A reader already familiar with flow matching or set encoders will see the value quickly, while someone outside acoustics might find the application interesting but not transformative. The paper is coherent enough on its own terms to deserve a serious referee. The core idea and implementation are worth the time even if revisions are needed on the evaluation details.

Referee Report

2 major / 2 minor

Summary. The paper introduces SF-Flow, a flow-matching framework for 3D acoustic transfer function (ATF) magnitude reconstruction from arbitrary sparse microphone arrays. It employs a 3D U-Net conditioned by a permutation-invariant set encoder to treat reconstruction as a guided generative task, claiming accurate results up to 1 kHz, substantially faster training than an autoencoder baseline, and clear performance gains with increasing dataset size.

Significance. If the experimental claims hold, the work would be significant for spatial audio and room acoustics applications by providing an efficient generative solution to an ill-posed inverse problem that naturally accommodates variable numbers of inputs. The use of flow matching for stable training and the set-encoder conditioning are clear technical strengths that address practical constraints in microphone array setups.

major comments (2)

[Abstract and §4] The abstract and §4 (experimental results) assert accurate reconstruction up to 1 kHz, faster training, and scaling with dataset size, yet supply no quantitative metrics (e.g., mean squared error, perceptual measures), error bars, dataset descriptions, microphone array configurations, or implementation details of the autoencoder baseline. This absence prevents verification of whether the data support the headline claims.
[§3 and §4] The central claim that the permutation-invariant set encoder enables reliable recovery from arbitrary sparse inputs without artifacts at higher frequencies or complex geometries is load-bearing, but the manuscript provides no ablation on encoder variants or failure cases beyond 1 kHz to substantiate robustness.

minor comments (2)

[§3] Notation for the conditioning mechanism (permutation-invariant set encoder) could be clarified with an explicit equation or diagram in §3 to show how variable-length inputs are aggregated before the 3D U-Net.
[§4] The manuscript would benefit from a table summarizing training times, reconstruction errors, and dataset sizes across methods to make the speed and scaling claims immediately comparable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity and substantiation of our claims. We address each point below and will incorporate revisions to strengthen the paper.

read point-by-point responses

Referee: [Abstract and §4] The abstract and §4 (experimental results) assert accurate reconstruction up to 1 kHz, faster training, and scaling with dataset size, yet supply no quantitative metrics (e.g., mean squared error, perceptual measures), error bars, dataset descriptions, microphone array configurations, or implementation details of the autoencoder baseline. This absence prevents verification of whether the data support the headline claims.

Authors: We agree that the absence of explicit numerical metrics in the text of the abstract and §4 limits verifiability. While §4 presents visual comparisons in figures showing reconstruction quality, we acknowledge that tabulated values, error bars, dataset details, array configurations, and baseline implementation specifics are not provided in the prose. In the revised manuscript, we will add a summary table in §4 with mean squared error (MSE) and standard deviations across frequency bands, training time comparisons, scaling results with dataset size, descriptions of the simulated room datasets, microphone array setups (e.g., random sparse positions and counts), and autoencoder baseline details (architecture, hyperparameters, and training protocol). This will directly support the stated claims. revision: yes
Referee: [§3 and §4] The central claim that the permutation-invariant set encoder enables reliable recovery from arbitrary sparse inputs without artifacts at higher frequencies or complex geometries is load-bearing, but the manuscript provides no ablation on encoder variants or failure cases beyond 1 kHz to substantiate robustness.

Authors: The permutation-invariant set encoder is central to accommodating arbitrary sparse microphone inputs. Our experiments in §4 evaluate performance across varying numbers of inputs and configurations up to 1 kHz, with results indicating stable reconstruction without prominent artifacts in the tested cases. However, we agree that the lack of an explicit ablation on encoder variants (e.g., permutation-invariant vs. ordered or non-set alternatives) and analysis of failure modes at higher frequencies or complex geometries weakens the robustness claim. We will add an ablation study to the revised §4, including quantitative metrics comparing encoder variants and qualitative/quantitative examples of performance limits beyond 1 kHz and in more complex geometries. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript frames 3D ATF magnitude reconstruction as a guided generative task solved by flow matching on a 3D U-Net conditioned by a permutation-invariant set encoder. This architecture directly targets variable input cardinality and the ill-posed inverse problem without any load-bearing step that reduces, by the paper's own equations or self-citation, to a fitted parameter or prior result from the same authors. Experimental claims (accuracy to 1 kHz, faster training than autoencoder baseline, scaling with dataset size) are presented as empirical outcomes rather than derivations that are tautological by construction. No self-definitional loops, fitted-input predictions, or uniqueness theorems imported from overlapping prior work appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; full text would be required to audit these.

pith-pipeline@v0.9.0 · 5474 in / 1190 out tokens · 57583 ms · 2026-05-12T03:31:56.499306+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
We propose a novel framework for 3D ATF magnitude reconstruction as a guided generation task, with a 3D U-Net conditioned by a permutation-invariant set encoder... Flow Matching (FM) learns to transform samples from a simple prior distribution... LOT-CFM loss
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
The dynamics of this evolution are governed by an Ordinary Differential Equation (ODE) defined by a time-dependent vector field u_t

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 3 internal anchors

[1]

INTRODUCTION Reconstruction of acoustic fields from sparse measurements is a cen- tral challenge in spatial audio. This is typically done in the domain of Room Impulse Responses (RIRs) or Acoustic Transfer Functions (ATFs), which are acoustical fingerprints of an environment and are essential for applications ranging from room acoustics analysis to immers...

work page
[2]

PROPOSED METHOD 2.1. Problem Statement The ATF, denoted byH(p src,p mic, f), is a complex-valued function that describes the acoustic response between a source positionp src arXiv:2605.10398v1 [eess.AS] 11 May 2026 and a microphone positionp mic at a specific frequencyf. Its mag- nitude,|H(·)|, captures modal resonances, spectral coloration, and frequency...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Experimental Setup We simulated RIRs using thepyroomacousticslibrary [36] for a room of dimensions4 m×6 m×3 m, with a reverberation time (T60) of0.2 s

EXPERIMENTS 3.1. Experimental Setup We simulated RIRs using thepyroomacousticslibrary [36] for a room of dimensions4 m×6 m×3 m, with a reverberation time (T60) of0.2 s. In each simulation, a sound source was placed at a random position. The ground-truth sound field was represented by a 3D ATF magnitude cube, sampled at 1331 microphone positions on a unifo...

work page
[4]

Expanding to R2 and R3, achieves substantial reductions in LSD, whilst maintaining faster and more efficient training. Crucially, nei- ther R2 (900 epochs), R3 (600 epochs), nor R3 Long (R3 trained for 1,000 epochs) had converged at their reported checkpoints, suggest- ing further gains are possible with continued training

work page
[5]

Our architecture using a 3D U-Net conditioned by a permutation-invariant set encoder enables reconstruction from an arbitrary number of measurements

CONCLUSION AND FUTURE WORK We proposed SF-Flow, a method for estimating 3D ATF magnitudes from spatially sparse measurements based on FM. Our architecture using a 3D U-Net conditioned by a permutation-invariant set encoder enables reconstruction from an arbitrary number of measurements. Experimental results demonstrated that SF-Flow achieves perfor- mance...

work page
[6]

Personal sound zones: Delivering interface-free audio to multiple listeners,

T. Betlehem et al., “Personal sound zones: Delivering interface-free audio to multiple listeners,”IEEE Signal Pro- cess. Mag., vol. 32, no. 2, pp. 81–91, 2015

work page 2015
[7]

Ueno and S

N. Ueno and S. Koyama,Sound Field Estimation: Theories and Applications (Foundations and Trends® in Signal Process- ing), vol. 19, Now Publishers, 2025

work page 2025
[8]

Kuttruff,Room Acoustics, Spon Press, 2000

H. Kuttruff,Room Acoustics, Spon Press, 2000

work page 2000
[9]

Room response equaliza- tion—a review,

S. Cecchi, A. Carini, and S. Spors, “Room response equaliza- tion—a review,”Applied Sciences, vol. 8, no. 1, 2018

work page 2018
[10]

Low-frequency optimization us- ing multiple subwoofers,

T. Welti and A. Devantier, “Low-frequency optimization us- ing multiple subwoofers,”Journal of The Audio Engineering Society, vol. 54, pp. 347–364, 2006

work page 2006
[11]

Sound field reconstruction in rooms: Inpainting meets super- resolution,

F. Llu ´ıs, P. Mart´ınez-Nuevo, M. B. Møller, and S. E. Shepstone, “Sound field reconstruction in rooms: Inpainting meets super- resolution,”J. Acoust. Soc. Amer., vol. 148, no. 2, 2020

work page 2020
[12]

Sound field reconstruction using neural processes with dynamic kernels,

Z. Liang, W. Zhang, and T. D. Abhayapala, “Sound field reconstruction using neural processes with dynamic kernels,” EURASIP J. Audio, Speech, Music Proc., vol. 13, 2024

work page 2024
[13]

Reconstruction of sound field through diffu- sion models,

F. Miotello et al., “Reconstruction of sound field through diffu- sion models,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2024, pp. 1476–1480

work page 2024
[14]

Learning magnitude distribution of sound fields via conditioned autoencoder,

S. Koyama and K. Ishizuka, “Learning magnitude distribution of sound fields via conditioned autoencoder,” inProc. Forum Acusticum, M´alaga, Jun. 2025

work page 2025
[15]

Generative data augmentation challenge: Synthe- sis of room acoustics for speaker distance estimation,

J. Lin et al., “Generative data augmentation challenge: Synthe- sis of room acoustics for speaker distance estimation,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. Workshops (ICASSPW), 2025, pp. 1–5

work page 2025
[16]

E. G. Williams,Fourier Acoustics: Sound Radiation and Nearfield Acoustical Holography, Academic Press, 1999

work page 1999
[17]

Perceptual soundfield reconstruction in three dimensions via sound field extrapolation,

E. Erdem, E. De Sena, H. Hacıhabibo ˘glu, and Z. Cvetkovi ´c, “Perceptual soundfield reconstruction in three dimensions via sound field extrapolation,” inIEEE Int. Conf. on Acous., Speech, Signal Process.(ICASSP), 2019, pp. 8023–8027

work page 2019
[18]

Colton and R

D. Colton and R. Kress,Inverse Acoustic and Electromagnetic Scattering Theory, Springer, 2013

work page 2013
[19]

Room impulse response interpolation from a sparse set of measurements using a modal architecture,

O. Das, P. Calamia, and S. V . A. Gari, “Room impulse response interpolation from a sparse set of measurements using a modal architecture,” inIEEE Int. Conf. Acous., Speech Sig. Process. (ICASSP). IEEE, 2021, pp. 960–964

work page 2021
[20]

Sound field in- terpolation via sparse plane wave decomposition for 6dof im- mersive audio,

O. Olgun, E. Erdem, and H. Hacıhabibo ˘glu, “Sound field in- terpolation via sparse plane wave decomposition for 6dof im- mersive audio,” inImmersive and 3D Audio: from Architecture to Automotive (I3DA), 2023, pp. 1–10

work page 2023
[21]

Sparse representation of a spa- tial sound field in a reverberant environment,

S. Koyama and L. Daudet, “Sparse representation of a spa- tial sound field in a reverberant environment,”IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 172–184, 2019

work page 2019
[22]

Directionally weighted wave field estimation exploiting prior information on source direction,

N. Ueno, S. Koyama, and H. Saruwatari, “Directionally weighted wave field estimation exploiting prior information on source direction,”IEEE Trans. Signal Process., vol. 69, pp. 2383–2395, 2021

work page 2021
[23]

Physics-informed machine learning for sound field estimation: Fundamentals, state of the art, and chal- lenges,

S. Koyama et al., “Physics-informed machine learning for sound field estimation: Fundamentals, state of the art, and chal- lenges,”IEEE Signal Process. Mag., vol. 41, no. 6, pp. 60–71, 2025

work page 2025
[24]

Learning neural acoustic fields,

A. Luo et al., “Learning neural acoustic fields,” inInt. Conf. on Neural Information Processing Systems (NIPS), Red Hook, NY , USA, 2022, NIPS ’22, Curran Associates Inc

work page 2022
[25]

Sound field estimation based on physics-constrained kernel interpolation adapted to environ- ment,

J. G. C. Ribeiro et al., “Sound field estimation based on physics-constrained kernel interpolation adapted to environ- ment,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 32, pp. 4369–4383, 2024

work page 2024
[26]

Physics-informed neural network for volu- metric sound field reconstruction of speech signals,

M. Olivieri et al., “Physics-informed neural network for volu- metric sound field reconstruction of speech signals,”EURASIP J. Audio, Speech, Music Proc., vol. 42, 2024

work page 2024
[27]

Generative ad- versarial networks with physical sound field priors,

X. Karakonstantis and E. Fernandez-Grande, “Generative ad- versarial networks with physical sound field priors,”The Jour- nal of the Acoustical Society of America, vol. 154, no. 2, pp. 1226–1238, 08 2023

work page 2023
[28]

Fast-rir: Fast neural diffuse room impulse response generator,

A. Ratnarajah et al., “Fast-rir: Fast neural diffuse room impulse response generator,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2022, pp. 571–575

work page 2022
[29]

Diffu- sionrir: Room impulse response interpolation using diffusion models,

S. D. Torre, M. Pezzoli, F. Antonacci, and S. Gannot, “Diffu- sionrir: Room impulse response interpolation using diffusion models,” inProc. Forum Acusticum, M ´alaga, Jun. 2025

work page 2025
[30]

Solving audio inverse problems with a diffusion model,

E. Moliner Juanpere, J. Lehtinen, and V . V ¨alim¨aki, “Solving audio inverse problems with a diffusion model,” in2023 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), United States, June 2023, pp. 1–5, IEEE

work page 2023
[31]

Gencho: Room impulse response generation from reverberant speech and text via diffusion transformers,

J. Lin, J. Su, N. Anand, Z. Jin, M. Kim, and P. Smaragdis, “Gencho: Room impulse response generation from reverberant speech and text via diffusion transformers,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2026

work page 2026
[32]

Flow matching for generative modeling,

Y . Lipman et al., “Flow matching for generative modeling,” in Int. Conf. on Learning Representations (ICLR), 2023

work page 2023
[33]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,”ArXiv, vol. abs/2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Few-shot acoustic synthesis with multimodal flow matching,

A. Brunetto, “Few-shot acoustic synthesis with multimodal flow matching,” inCVPR, 2026

work page 2026
[35]

Solving room impulse response inverse problems using flow matching with analytic wiener denoiser,

K. Y . Lee, N. Meyer-Kahlen, V . V¨alim¨aki, and S. J. Schlecht, “Solving room impulse response inverse problems using flow matching with analytic wiener denoiser,”arXiv preprint arXiv:2602.00652, 2026

work page arXiv 2026
[36]

Room impulse response generation conditioned on acoustic parame- ters,

S. Arellano, C. Yeh, G. Bhattacharya, and D. Arteaga, “Room impulse response generation conditioned on acoustic parame- ters,” inProc. IEEE Int. Workshop Appl. Signal Process. Audio Acoust. (WASPAA), July 2025

work page 2025
[37]

Multimodal room impulse response generation through latent rectified flow matching,

A. V osoughi, Y . Zang, Q. Yang, N. Paek, R. Leistikow, and C. Xu, “Multimodal room impulse response generation through latent rectified flow matching,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2026

work page 2026
[38]

Introduction to flow matching and diffusion models,

P. Holderrieth and E. Erives, “Introduction to flow matching and diffusion models,” MIT Lecture Notes, 2026

work page 2026
[39]

Flow Matching Guide and Code

Y . Lipman et al., “Flow matching guide and code,”arXiv preprint arXiv:2412.06264, 2024

work page internal anchor Pith review arXiv 2024
[40]

Film: Visual reasoning with a general condi- tioning layer,

E. Perez et al., “Film: Visual reasoning with a general condi- tioning layer,” inAAAI, 2018

work page 2018
[41]

Pyroomacoustics: A python package for audio room simulation and array pro- cessing algorithms,

R. Scheibler, E. Bezzam, and I. Dokmanic, “Pyroomacoustics: A python package for audio room simulation and array pro- cessing algorithms,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Apr. 2018, p. 351–355

work page 2018