MeCo: One-Step MeanFlow-based Corrector for Multi-Channel Speech Separation

Dohwan Kim; Jung-Woo Choi

arxiv: 2606.09677 · v2 · pith:OOPQ25HWnew · submitted 2026-06-08 · 📡 eess.AS · cs.AI

MeCo: One-Step MeanFlow-based Corrector for Multi-Channel Speech Separation

Dohwan Kim , Jung-Woo Choi This is my paper

Pith reviewed 2026-06-27 14:50 UTC · model grok-4.3

classification 📡 eess.AS cs.AI

keywords multi-channel speech separationgenerative correctorMeanFlowone-step generationData-Space Optimizationperceptual qualityspeech enhancement

0 comments

The pith

MeCo maps any discriminative multi-channel speech separation estimate onto the clean speech manifold in one MeanFlow step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Discriminative models for multi-channel speech separation deliver strong reference metrics yet often fall short on human listening quality. MeCo corrects this by learning a conditional average velocity field that performs the mapping from estimate to clean speech in a single generative step. Data-Space Optimization trains this field with an x_r-loss on longer displacement intervals together with an Endpoint SI-SDR loss to balance perceptual quality and terminal fidelity. The result is claimed to reach state-of-the-art performance at negligible extra cost in both matched and mismatched conditions.

Core claim

MeCo learns a conditional average velocity field to map discriminative estimates directly onto the clean speech manifold in a single step. Data-Space Optimization integrates an x_r-loss, which penalizes prediction errors on longer displacement intervals to serve as a generative objective for human listening quality, with an Endpoint SI-SDR loss that directly optimizes terminal signal fidelity.

What carries the argument

The MeanFlow conditional average velocity field, which performs the direct one-step mapping from discriminative estimate to clean speech.

If this is right

State-of-the-art signal fidelity is achieved with only minimal added computation.
Human listening quality improves simultaneously with reference metrics.
The gains hold for both in-domain and out-of-domain test conditions.
One-step generation replaces multi-step sampling while retaining generative benefits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The single-step design could support lower-latency real-time speech separation systems.
Data-Space Optimization may transfer to other audio tasks where perceptual quality must be balanced against reference metrics.
MeanFlow velocity fields might serve as lightweight correctors for other discriminative audio models beyond separation.

Load-bearing premise

A single step of the learned conditional average velocity field is sufficient to map any discriminative estimate directly onto the clean speech manifold.

What would settle it

A controlled listening test in which MeCo outputs receive no higher perceptual ratings than the uncorrected outputs of the underlying discriminative separator.

read the original abstract

While discriminative models for multi-channel speech separation excel in reference-based metrics, they often exhibit suboptimal human listening quality. To address this, we propose a novel MeanFlow-based one-step generative corrector (MeCo). MeCo learns a conditional average velocity field to map discriminative estimates directly onto the clean speech manifold in a single step. To maximize one-step generation performance, we introduce Data-Space Optimization (DSO). DSO integrates an $\mathbf{x}_r$-loss, which penalizes prediction errors on longer displacement intervals to serve as a generative objective for human listening quality, with an Endpoint SI-SDR loss that directly optimizes terminal signal fidelity. Experiments demonstrate that MeCo achieves state-of-the-art (SOTA) performance with minimal computational overhead, simultaneously achieving superior signal fidelity and human listening quality in both in-domain and out-of-domain scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MeCo adds a one-step MeanFlow corrector and DSO loss combo to fix listening quality gaps in multi-channel separation, but the abstract supplies zero experimental detail so the SOTA and single-step claims stay uncheckable.

read the letter

The key takeaway is that MeCo introduces a one-step generative corrector based on MeanFlow to refine discriminative multi-channel speech separation outputs for better listening quality, using a new DSO procedure that blends x_r-loss and endpoint SI-SDR.

This is new in applying the MeanFlow velocity field idea specifically as a one-step corrector with that loss combination to the separation task. It does well in spotting the practical problem that good SI-SDR scores don't always mean good perceptual results and in proposing a low-overhead fix.

The soft spots are bigger than minor. The entire description is abstract-level, with no equations shown, no experimental setup, no results tables or comparisons. That makes it impossible to tell if the SOTA performance in in-domain and out-of-domain cases is real or if the one-step mapping actually lands on the clean manifold without artifacts. The stress-test concern about distant out-of-domain estimates is reasonable given the lack of any guarantee or validation for the single-step trajectory.

This paper is aimed at speech processing researchers focused on enhancement and separation for real-world applications like devices. A reader interested in hybrid models might get some ideas from it. It deserves a serious referee because the topic has clear applied value and the method is described coherently enough to be worth checking the details on.

I would recommend engaging with the work by sending it for peer review to get the full manuscript evaluated.

Referee Report

2 major / 0 minor

Summary. The paper proposes MeCo, a MeanFlow-based one-step generative corrector for multi-channel speech separation. It learns a conditional average velocity field to map discriminative estimates directly onto the clean speech manifold in a single step. Data-Space Optimization (DSO) is introduced, combining an x_r-loss (penalizing errors on longer displacement intervals) with an Endpoint SI-SDR loss to optimize for human listening quality alongside signal fidelity. Experiments claim SOTA performance with minimal overhead, superior fidelity and listening quality in both in-domain and out-of-domain scenarios.

Significance. If the one-step correction holds, MeCo would offer an efficient post-processing layer that improves perceptual quality of existing discriminative separators without substantial compute, addressing a known gap between reference metrics and human listening in multi-channel separation.

major comments (2)

[Abstract] Abstract: the central claim that a single Euler integration of the learned conditional average velocity field suffices to reach the clean-speech manifold from any discriminative estimate (including distant out-of-domain cases) lacks supporting derivation or guarantee; the construction of DSO, x_r-loss, and Endpoint SI-SDR does not by itself ensure the learned field remains accurate far from the data manifold or that one step avoids audible artifacts.
[Abstract] Abstract: the assertion of simultaneous SOTA signal fidelity and human listening quality in out-of-domain scenarios rests on the unverified premise that the one-step trajectory lands inside the manifold; no independent check (e.g., manifold-distance metric or artifact analysis) is described to confirm this for estimates lying far from training data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that a single Euler integration of the learned conditional average velocity field suffices to reach the clean-speech manifold from any discriminative estimate (including distant out-of-domain cases) lacks supporting derivation or guarantee; the construction of DSO, x_r-loss, and Endpoint SI-SDR does not by itself ensure the learned field remains accurate far from the data manifold or that one step avoids audible artifacts.

Authors: We agree that no formal derivation or theoretical guarantee is provided for one-step convergence to the manifold, particularly for distant out-of-domain estimates. DSO is an empirical training strategy. In revision we will soften the abstract language to emphasize the empirical nature of the claim and add a short discussion subsection on the one-step assumption, supported by additional out-of-domain artifact analysis. revision: yes
Referee: [Abstract] Abstract: the assertion of simultaneous SOTA signal fidelity and human listening quality in out-of-domain scenarios rests on the unverified premise that the one-step trajectory lands inside the manifold; no independent check (e.g., manifold-distance metric or artifact analysis) is described to confirm this for estimates lying far from training data.

Authors: The manuscript currently relies on SI-SDR and listening-quality metrics as proxies. We acknowledge the lack of an explicit manifold-distance metric or dedicated artifact analysis. We will add a new analysis subsection containing qualitative artifact examples and a simple embedding-based distance check for out-of-domain cases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The provided abstract and context describe MeCo as learning a conditional average velocity field from data to perform a one-step mapping, optimized via the introduced DSO combining x_r-loss and Endpoint SI-SDR loss. No equations, self-citations, or load-bearing steps are shown that reduce a claimed prediction or result to its own inputs by construction. The method is presented as data-driven empirical learning rather than self-definitional or fitted-input renaming, making the derivation independent of the target claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described. The central claim rests on unstated assumptions about the existence and learnability of the mean velocity field and the effectiveness of DSO for perceptual quality.

pith-pipeline@v0.9.1-grok · 5672 in / 1041 out tokens · 16660 ms · 2026-06-27T14:50:23.914614+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 2 canonical work pages · 2 internal anchors

[1]

MeCo: One-Step MeanFlow-based Corrector for Multi-Channel Speech Separation

Introduction Deep discriminative models have significantly advanced multi- channel speech enhancement and separation. Modern architec- tures [1–4], readily adaptable across joint denoising, derever- beration, and speech separation, have achieved saturated per- formance on reference-based metrics. However, these models are primarily trained to optimize obj...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Flow Matching Flow Matching (FM) [11] is a generative framework that learns to construct a flow path between a simple prior distributionp 0 and a complex data distributionp1

Background 2.1. Flow Matching Flow Matching (FM) [11] is a generative framework that learns to construct a flow path between a simple prior distributionp 0 and a complex data distributionp1. Formally, given a prior sam- plex 0 ∼p 0 and a data samplex 1 ∼p 1, a statex t along the flow path at timet∈[0,1]can be explicitly constructed using predefined schedu...
[3]

MeCo incorporates a conditional MeanFlow-based architecture (Section 3.1) and DSO to maxi- mize one-step generation performance (Section 3.2)

Method We introduce MeCo, a one-step generative corrector for multi- channel speech separation. MeCo incorporates a conditional MeanFlow-based architecture (Section 3.1) and DSO to maxi- mize one-step generation performance (Section 3.2). 3.1. Conditional MeanFlow-based correction The proposed corrector operates in the complex Short-Time Fourier Transform...
[4]

Datasets To evaluate the proposed MeCo, we constructed multi-channel noisy and reverberant datasets

Experiments 4.1. Datasets To evaluate the proposed MeCo, we constructed multi-channel noisy and reverberant datasets. For the in-domain training and test sets, we used clean speech from the WSJ0 corpus mixed with noise from WHAM! [30]. To assess the model’s general- ization capabilities, we constructed two separate out-of-domain evaluation sets. The first...
[5]

By leveraging Mean Flows, MeCo effectively maps discriminative estimates directly onto the clean speech manifold in a single step

Conclusion We proposed MeCo, the first one-step generative corrector for multi-channel speech separation. By leveraging Mean Flows, MeCo effectively maps discriminative estimates directly onto the clean speech manifold in a single step. To maximize one- step generation performance, we introduced DSO, which incor- porates anx r-loss and an Endpoint SI-SDR ...
[6]

RS-2024-00337945), STEAM re- search grant (No

Acknowledgements This work was supported by the National Research Foundation of Korea (NRF) grant (No. RS-2024-00337945), STEAM re- search grant (No. RS-2024-00464269) funded by the Ministry of Science and ICT of Korea government (MSIT), and the BK21 FOUR program through the NRF grant funded by the Ministry of Education of Korea government (MOE)

2024
[7]

Generative AI Use Disclosure Generative AI tools were used to edit and polish the manuscript, improving readability and refining the experimental code
[8]

TF-GridNet: Integrating full-and sub-band modeling for speech separation,

Z.-Q. Wang, S. Cornell, S. Choi, Y . Lee, B.-Y . Kim, and S. Watan- abe, “TF-GridNet: Integrating full-and sub-band modeling for speech separation,”TASLP, vol. 31, pp. 3221–3236, 2023

2023
[9]

SpatialNet: Extensively learning spatial in- formation for multichannel joint speech separation, denoising and dereverberation,

C. Quan and X. Li, “SpatialNet: Extensively learning spatial in- formation for multichannel joint speech separation, denoising and dereverberation,”TASLP, vol. 32, pp. 1310–1323, 2024

2024
[10]

TF-CrossNet: Leveraging global, cross-band, narrow-band, and positional encoding for single-and multi-channel speaker separation,

V . A. Kalkhorani and D. Wang, “TF-CrossNet: Leveraging global, cross-band, narrow-band, and positional encoding for single-and multi-channel speaker separation,”TASLP, vol. 32, pp. 4999– 5009, 2024

2024
[11]

DeFTAN-II: Efficient multichannel speech enhancement with subgroup processing,

D. Lee and J.-W. Choi, “DeFTAN-II: Efficient multichannel speech enhancement with subgroup processing,”TASLP, vol. 32, p. 4850–4866, 2024

2024
[12]

SDR– half-baked or well done?

J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR– half-baked or well done?” inProc. ICASSP, 2019

2019
[13]

Universal speech enhancement with score-based diffusion,

J. Serr `a, S. Pascual, J. Pons, R. O. Araz, and D. Scaini, “Universal speech enhancement with score-based diffusion,” inProc. ICLR, 2023

2023
[14]

Speech enhancement and dereverberation with diffusion-based generative models,

J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech enhancement and dereverberation with diffusion-based generative models,”TASLP, vol. 31, p. 2351–2364, 2023

2023
[15]

DNSMOS P. 835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,

C. K. Reddy, V . Gopal, and R. Cutler, “DNSMOS P. 835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,” inProc. ICASSP, 2022

2022
[16]

Utmos: Utokyo-sarulab system for voicemos challenge 2022,

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “Utmos: Utokyo-sarulab system for voicemos challenge 2022,” inInterspeech, 2022

2022
[17]

Score-based generative modeling through stochas- tic differential equations,

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochas- tic differential equations,” inProc. ICLR, 2021

2021
[18]

Flow matching for generative modeling,

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” inProc. ICLR, 2023

2023
[19]

Conditional diffusion probabilistic model for speech en- hancement,

Y .-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y . Tsao, “Conditional diffusion probabilistic model for speech en- hancement,” inProc. ICASSP, 2022

2022
[20]

StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,

J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,”TASLP, vol. 31, pp. 2724–2737, 2023

2023
[21]

Diffusion-based generative speech source separation,

R. Scheibler, Y . Ji, S.-W. Chung, J. Byun, S. Choe, and M.-S. Choi, “Diffusion-based generative speech source separation,” in Proc. ICASSP, 2023

2023
[22]

Generative pre-training for speech with flow matching,

A. H. Liu, M. Le, A. Vyas, B. Shi, A. Tjandra, and W.-N. Hsu, “Generative pre-training for speech with flow matching,” inProc. ICLR, 2024

2024
[23]

EDSep: An effective diffusion- based method for speech source separation,

J. Dong, X. Wang, and Q. Mao, “EDSep: An effective diffusion- based method for speech source separation,” inProc. ICASSP, 2025

2025
[24]

Source sepa- ration by flow matching,

R. Scheibler, J. R. Hershey, A. Doucet, and H. Li, “Source sepa- ration by flow matching,” inProc. WASPAA, 2025

2025
[25]

DiffCBF: A diffusion model with convolutional beamformer for joint speech separation, denoising, and derever- beration,

R. Kimura, T. Ueda, T. Nakatani, N. Kamo, M. Delcroix, S. Araki, and S. Makino, “DiffCBF: A diffusion model with convolutional beamformer for joint speech separation, denoising, and derever- beration,” inProc. EUSIPCO, 2025

2025
[26]

Ar- raydps: Unsupervised blind speech separation with a diffusion prior,

Z. Xu, X. Fan, Z.-Q. Wang, X. Jiang, and R. R. Choudhury, “Ar- raydps: Unsupervised blind speech separation with a diffusion prior,” inProc. ICML, 2025

2025
[27]

Diffiner: A versatile diffusion-based generative refiner for speech enhancement,

R. Sawata, N. Murata, Y . Takida, T. Uesaka, T. Shibuya, S. Taka- hashi, and Y . Mitsufuji, “Diffiner: A versatile diffusion-based generative refiner for speech enhancement,” inProc. Interspeech, 2023

2023
[28]

Separate and diffuse: Using a pretrained diffusion model for improving source separation,

S. Lutati, E. Nachmani, and L. Wolf, “Separate and diffuse: Using a pretrained diffusion model for improving source separation,” in Proc. ICLR, 2024

2024
[29]

Noise-robust speech separation with fast generative correction,

H. Wang, J. Villalba, L. Moro-Velazquez, J. Hai, T. Thebaud, and N. Dehak, “Noise-robust speech separation with fast generative correction,” inProc. Interspeech, 2024

2024
[30]

SpeechRe- finer: Towards perceptual quality refinement for front-end algo- rithms,

S. Li, S. Wang, Z. Liu, Z. Jiang, Y . Wang, and H. Li, “SpeechRe- finer: Towards perceptual quality refinement for front-end algo- rithms,” inProc. Interspeech, 2025

2025
[31]

Mean flows for one-step generative modeling,

Z. Geng, M. Deng, X. Bai, J. Z. Kolter, and K. He, “Mean flows for one-step generative modeling,” inProc. NeurIPS, 2025

2025
[32]

Back to Basics: Let Denoising Generative Models Denoise

T. Li and K. He, “Back to basics: Let denoising generative models denoise,”arXiv preprint arXiv:2511.13720, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Mean- FlowSE: one-step generative speech enhancement via conditional mean flow,

D. Li, S. Lu, H. Pan, Z. Zhan, Q. Hong, and L. Li, “Mean- FlowSE: one-step generative speech enhancement via conditional mean flow,” inProc. ICASSP, 2026

2026
[34]

MeanSE: Efficient generative speech enhancement with mean flows,

J. Wang, H. Wang, W. Wang, L. Yang, C. Li, W. Zhang, L. Tan, and Y . Qian, “MeanSE: Efficient generative speech enhancement with mean flows,” inProc. ICASSP, 2026

2026
[35]

Flowse: Flow matching-based speech enhancement,

S. Lee, S. Cheong, S. Han, and J. W. Shin, “Flowse: Flow matching-based speech enhancement,” inProc. ICASSP, 2025

2025
[36]

A step-by-step process for building tts voices using open source data and frameworks for bangla, ja- vanese, khmer, nepali, sinhala, and sundanese

K. Sodimana, P. De Silva, S. Sarin, O. Kjartansson, M. Jansche, K. Pipatsrisawat, and L. Ha, “A step-by-step process for building tts voices using open source data and frameworks for bangla, ja- vanese, khmer, nepali, sinhala, and sundanese.” inProc. SLTU, 2018

2018
[37]

WHAM!: Extending speech separation to noisy environments,

G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. L. Roux, “WHAM!: Extending speech separation to noisy environments,” inProc. Interspeech, 2019

2019
[38]

Lib- rispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” inProc. ICASSP, 2015

2015
[39]

The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings,

J. Thiemann, N. Ito, and E. Vincent, “The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings,” inProc. of Meet- ings on Acoustics, 2013

2013
[40]

gpuRIR: A python library for room impulse response simulation with gpu accelera- tion,

D. Diaz-Guerra, A. Miguel, and J. R. Beltran, “gpuRIR: A python library for room impulse response simulation with gpu accelera- tion,”Multimedia Tools and Applications, vol. 80, pp. 5653–5671, 2021

2021
[41]

Adam: A method for stochastic opti- mization,

D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- mization,” inProc. ICLR, 2015

2015
[42]

SA-SDR: A novel loss function for separation of meeting style data,

T. von Neumann, K. Kinoshita, C. Boeddeker, M. Delcroix, and R. Haeb-Umbach, “SA-SDR: A novel loss function for separation of meeting style data,” inProc. ICASSP, 2022

2022
[43]

Per- ceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,

A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Per- ceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, 2001

2001
[44]

An algorithm for predicting the intelli- gibility of speech masked by modulated noise maskers,

J. Jensen and C. H. Taal, “An algorithm for predicting the intelli- gibility of speech masked by modulated noise maskers,”TASLP, vol. 24, no. 11, pp. 2009–2022, 2016

2009
[45]

NISQA: A deep cnn-self-attention model for multidimensional speech quality pre- diction with crowdsourced datasets,

G. Mittag, B. Naderi, A. Chehadi, and S. M¨oller, “NISQA: A deep cnn-self-attention model for multidimensional speech quality pre- diction with crowdsourced datasets,” inProc. Interspeech, 2021

2021

[1] [1]

MeCo: One-Step MeanFlow-based Corrector for Multi-Channel Speech Separation

Introduction Deep discriminative models have significantly advanced multi- channel speech enhancement and separation. Modern architec- tures [1–4], readily adaptable across joint denoising, derever- beration, and speech separation, have achieved saturated per- formance on reference-based metrics. However, these models are primarily trained to optimize obj...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Flow Matching Flow Matching (FM) [11] is a generative framework that learns to construct a flow path between a simple prior distributionp 0 and a complex data distributionp1

Background 2.1. Flow Matching Flow Matching (FM) [11] is a generative framework that learns to construct a flow path between a simple prior distributionp 0 and a complex data distributionp1. Formally, given a prior sam- plex 0 ∼p 0 and a data samplex 1 ∼p 1, a statex t along the flow path at timet∈[0,1]can be explicitly constructed using predefined schedu...

[3] [3]

MeCo incorporates a conditional MeanFlow-based architecture (Section 3.1) and DSO to maxi- mize one-step generation performance (Section 3.2)

Method We introduce MeCo, a one-step generative corrector for multi- channel speech separation. MeCo incorporates a conditional MeanFlow-based architecture (Section 3.1) and DSO to maxi- mize one-step generation performance (Section 3.2). 3.1. Conditional MeanFlow-based correction The proposed corrector operates in the complex Short-Time Fourier Transform...

[4] [4]

Datasets To evaluate the proposed MeCo, we constructed multi-channel noisy and reverberant datasets

Experiments 4.1. Datasets To evaluate the proposed MeCo, we constructed multi-channel noisy and reverberant datasets. For the in-domain training and test sets, we used clean speech from the WSJ0 corpus mixed with noise from WHAM! [30]. To assess the model’s general- ization capabilities, we constructed two separate out-of-domain evaluation sets. The first...

[5] [5]

By leveraging Mean Flows, MeCo effectively maps discriminative estimates directly onto the clean speech manifold in a single step

Conclusion We proposed MeCo, the first one-step generative corrector for multi-channel speech separation. By leveraging Mean Flows, MeCo effectively maps discriminative estimates directly onto the clean speech manifold in a single step. To maximize one- step generation performance, we introduced DSO, which incor- porates anx r-loss and an Endpoint SI-SDR ...

[6] [6]

RS-2024-00337945), STEAM re- search grant (No

Acknowledgements This work was supported by the National Research Foundation of Korea (NRF) grant (No. RS-2024-00337945), STEAM re- search grant (No. RS-2024-00464269) funded by the Ministry of Science and ICT of Korea government (MSIT), and the BK21 FOUR program through the NRF grant funded by the Ministry of Education of Korea government (MOE)

2024

[7] [7]

Generative AI Use Disclosure Generative AI tools were used to edit and polish the manuscript, improving readability and refining the experimental code

[8] [8]

TF-GridNet: Integrating full-and sub-band modeling for speech separation,

Z.-Q. Wang, S. Cornell, S. Choi, Y . Lee, B.-Y . Kim, and S. Watan- abe, “TF-GridNet: Integrating full-and sub-band modeling for speech separation,”TASLP, vol. 31, pp. 3221–3236, 2023

2023

[9] [9]

SpatialNet: Extensively learning spatial in- formation for multichannel joint speech separation, denoising and dereverberation,

C. Quan and X. Li, “SpatialNet: Extensively learning spatial in- formation for multichannel joint speech separation, denoising and dereverberation,”TASLP, vol. 32, pp. 1310–1323, 2024

2024

[10] [10]

TF-CrossNet: Leveraging global, cross-band, narrow-band, and positional encoding for single-and multi-channel speaker separation,

V . A. Kalkhorani and D. Wang, “TF-CrossNet: Leveraging global, cross-band, narrow-band, and positional encoding for single-and multi-channel speaker separation,”TASLP, vol. 32, pp. 4999– 5009, 2024

2024

[11] [11]

DeFTAN-II: Efficient multichannel speech enhancement with subgroup processing,

D. Lee and J.-W. Choi, “DeFTAN-II: Efficient multichannel speech enhancement with subgroup processing,”TASLP, vol. 32, p. 4850–4866, 2024

2024

[12] [12]

SDR– half-baked or well done?

J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR– half-baked or well done?” inProc. ICASSP, 2019

2019

[13] [13]

Universal speech enhancement with score-based diffusion,

J. Serr `a, S. Pascual, J. Pons, R. O. Araz, and D. Scaini, “Universal speech enhancement with score-based diffusion,” inProc. ICLR, 2023

2023

[14] [14]

Speech enhancement and dereverberation with diffusion-based generative models,

J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech enhancement and dereverberation with diffusion-based generative models,”TASLP, vol. 31, p. 2351–2364, 2023

2023

[15] [15]

DNSMOS P. 835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,

C. K. Reddy, V . Gopal, and R. Cutler, “DNSMOS P. 835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,” inProc. ICASSP, 2022

2022

[16] [16]

Utmos: Utokyo-sarulab system for voicemos challenge 2022,

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “Utmos: Utokyo-sarulab system for voicemos challenge 2022,” inInterspeech, 2022

2022

[17] [17]

Score-based generative modeling through stochas- tic differential equations,

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochas- tic differential equations,” inProc. ICLR, 2021

2021

[18] [18]

Flow matching for generative modeling,

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” inProc. ICLR, 2023

2023

[19] [19]

Conditional diffusion probabilistic model for speech en- hancement,

Y .-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y . Tsao, “Conditional diffusion probabilistic model for speech en- hancement,” inProc. ICASSP, 2022

2022

[20] [20]

StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,

J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,”TASLP, vol. 31, pp. 2724–2737, 2023

2023

[21] [21]

Diffusion-based generative speech source separation,

R. Scheibler, Y . Ji, S.-W. Chung, J. Byun, S. Choe, and M.-S. Choi, “Diffusion-based generative speech source separation,” in Proc. ICASSP, 2023

2023

[22] [22]

Generative pre-training for speech with flow matching,

A. H. Liu, M. Le, A. Vyas, B. Shi, A. Tjandra, and W.-N. Hsu, “Generative pre-training for speech with flow matching,” inProc. ICLR, 2024

2024

[23] [23]

EDSep: An effective diffusion- based method for speech source separation,

J. Dong, X. Wang, and Q. Mao, “EDSep: An effective diffusion- based method for speech source separation,” inProc. ICASSP, 2025

2025

[24] [24]

Source sepa- ration by flow matching,

R. Scheibler, J. R. Hershey, A. Doucet, and H. Li, “Source sepa- ration by flow matching,” inProc. WASPAA, 2025

2025

[25] [25]

DiffCBF: A diffusion model with convolutional beamformer for joint speech separation, denoising, and derever- beration,

R. Kimura, T. Ueda, T. Nakatani, N. Kamo, M. Delcroix, S. Araki, and S. Makino, “DiffCBF: A diffusion model with convolutional beamformer for joint speech separation, denoising, and derever- beration,” inProc. EUSIPCO, 2025

2025

[26] [26]

Ar- raydps: Unsupervised blind speech separation with a diffusion prior,

Z. Xu, X. Fan, Z.-Q. Wang, X. Jiang, and R. R. Choudhury, “Ar- raydps: Unsupervised blind speech separation with a diffusion prior,” inProc. ICML, 2025

2025

[27] [27]

Diffiner: A versatile diffusion-based generative refiner for speech enhancement,

R. Sawata, N. Murata, Y . Takida, T. Uesaka, T. Shibuya, S. Taka- hashi, and Y . Mitsufuji, “Diffiner: A versatile diffusion-based generative refiner for speech enhancement,” inProc. Interspeech, 2023

2023

[28] [28]

Separate and diffuse: Using a pretrained diffusion model for improving source separation,

S. Lutati, E. Nachmani, and L. Wolf, “Separate and diffuse: Using a pretrained diffusion model for improving source separation,” in Proc. ICLR, 2024

2024

[29] [29]

Noise-robust speech separation with fast generative correction,

H. Wang, J. Villalba, L. Moro-Velazquez, J. Hai, T. Thebaud, and N. Dehak, “Noise-robust speech separation with fast generative correction,” inProc. Interspeech, 2024

2024

[30] [30]

SpeechRe- finer: Towards perceptual quality refinement for front-end algo- rithms,

S. Li, S. Wang, Z. Liu, Z. Jiang, Y . Wang, and H. Li, “SpeechRe- finer: Towards perceptual quality refinement for front-end algo- rithms,” inProc. Interspeech, 2025

2025

[31] [31]

Mean flows for one-step generative modeling,

Z. Geng, M. Deng, X. Bai, J. Z. Kolter, and K. He, “Mean flows for one-step generative modeling,” inProc. NeurIPS, 2025

2025

[32] [32]

Back to Basics: Let Denoising Generative Models Denoise

T. Li and K. He, “Back to basics: Let denoising generative models denoise,”arXiv preprint arXiv:2511.13720, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Mean- FlowSE: one-step generative speech enhancement via conditional mean flow,

D. Li, S. Lu, H. Pan, Z. Zhan, Q. Hong, and L. Li, “Mean- FlowSE: one-step generative speech enhancement via conditional mean flow,” inProc. ICASSP, 2026

2026

[34] [34]

MeanSE: Efficient generative speech enhancement with mean flows,

J. Wang, H. Wang, W. Wang, L. Yang, C. Li, W. Zhang, L. Tan, and Y . Qian, “MeanSE: Efficient generative speech enhancement with mean flows,” inProc. ICASSP, 2026

2026

[35] [35]

Flowse: Flow matching-based speech enhancement,

S. Lee, S. Cheong, S. Han, and J. W. Shin, “Flowse: Flow matching-based speech enhancement,” inProc. ICASSP, 2025

2025

[36] [36]

A step-by-step process for building tts voices using open source data and frameworks for bangla, ja- vanese, khmer, nepali, sinhala, and sundanese

K. Sodimana, P. De Silva, S. Sarin, O. Kjartansson, M. Jansche, K. Pipatsrisawat, and L. Ha, “A step-by-step process for building tts voices using open source data and frameworks for bangla, ja- vanese, khmer, nepali, sinhala, and sundanese.” inProc. SLTU, 2018

2018

[37] [37]

WHAM!: Extending speech separation to noisy environments,

G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. L. Roux, “WHAM!: Extending speech separation to noisy environments,” inProc. Interspeech, 2019

2019

[38] [38]

Lib- rispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” inProc. ICASSP, 2015

2015

[39] [39]

The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings,

J. Thiemann, N. Ito, and E. Vincent, “The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings,” inProc. of Meet- ings on Acoustics, 2013

2013

[40] [40]

gpuRIR: A python library for room impulse response simulation with gpu accelera- tion,

D. Diaz-Guerra, A. Miguel, and J. R. Beltran, “gpuRIR: A python library for room impulse response simulation with gpu accelera- tion,”Multimedia Tools and Applications, vol. 80, pp. 5653–5671, 2021

2021

[41] [41]

Adam: A method for stochastic opti- mization,

D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- mization,” inProc. ICLR, 2015

2015

[42] [42]

SA-SDR: A novel loss function for separation of meeting style data,

T. von Neumann, K. Kinoshita, C. Boeddeker, M. Delcroix, and R. Haeb-Umbach, “SA-SDR: A novel loss function for separation of meeting style data,” inProc. ICASSP, 2022

2022

[43] [43]

Per- ceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,

A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Per- ceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, 2001

2001

[44] [44]

An algorithm for predicting the intelli- gibility of speech masked by modulated noise maskers,

J. Jensen and C. H. Taal, “An algorithm for predicting the intelli- gibility of speech masked by modulated noise maskers,”TASLP, vol. 24, no. 11, pp. 2009–2022, 2016

2009

[45] [45]

NISQA: A deep cnn-self-attention model for multidimensional speech quality pre- diction with crowdsourced datasets,

G. Mittag, B. Naderi, A. Chehadi, and S. M¨oller, “NISQA: A deep cnn-self-attention model for multidimensional speech quality pre- diction with crowdsourced datasets,” inProc. Interspeech, 2021

2021