Cross-Talk Speech Reduction, by Separation, for Separation

Samuele Cornell; Zhong-Qiu Wang

arxiv: 2605.19695 · v1 · pith:BWI6AXPAnew · submitted 2026-05-19 · 📡 eess.AS · cs.SD

Cross-Talk Speech Reduction, by Separation, for Separation

Zhong-Qiu Wang , Samuele Cornell This is my paper

Pith reviewed 2026-05-20 01:50 UTC · model grok-4.3

classification 📡 eess.AS cs.SD

keywords speech separationcross-talk reductionpseudo-labelsfar-field audioCHiME-6conversational ASRreal data trainingneural networks

0 comments

The pith

Cross-talk reduction on real close-talk recordings produces pseudo-labels that train far-field separation models to new state-of-the-art ASR levels on CHiME-6.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that cross-talk contamination in close-talk microphone signals can be removed by a neural model trained directly on real pairs of close-talk and far-field mixtures. These cleaned signals then act as pseudo-labels for training separation models that operate on the far-field recordings. A sympathetic reader would care because conventional training relies on simulated data that fails to match real room acoustics, speaker movement, and noise. By using only target-domain recordings the method closes that gap and improves downstream automatic speech recognition. The result is the first neural separation system shown to substantially beat guided source separation on genuine conversational speech-in-the-wild.

Core claim

The central claim is that a network called CTRnet, trained end-to-end on real-recorded close-talk and far-field mixture pairs, isolates each speaker's voice from cross-talk interference; the resulting estimates serve as effective pseudo-labels for a second stage, pseudo-label based far-field speech separation, that achieves state-of-the-art ASR word error rates on the CHiME-6 dataset under both oracle and estimated diarization while surpassing all prior CHiME-7 and CHiME-8 submissions.

What carries the argument

CTRnet, a neural separation model trained on real close-talk/far-field pairs to isolate the wearer's speech from cross-talk, whose outputs supply the pseudo-labels for the PuLSS far-field training stage.

If this is right

Both CTRnet and the downstream far-field model can be trained entirely on real target-domain recordings without simulation.
The framework delivers state-of-the-art ASR under oracle and estimated speaker diarization on CHiME-6.
It is the first neural separation approach shown to substantially outperform guided source separation on real conversational data.
Close-talk mixtures, previously too noisy for direct use, become usable weak supervision after cross-talk reduction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same real-pair training pattern could be applied to other microphone arrays where partial close-talk signals exist.
Integrating CTRnet-style reduction inside an end-to-end diarization-plus-separation pipeline might further reduce error propagation.
The pseudo-label strategy suggests a general route for adapting separation models to new acoustic environments using only the deployment hardware.

Load-bearing premise

The speech estimates produced by CTRnet on real close-talk mixtures must be clean enough to improve far-field model training rather than inject harmful label noise.

What would settle it

If ASR word error rate on the CHiME-6 evaluation set rises or stays flat when far-field models are retrained with CTRnet pseudo-labels instead of guided source separation labels, the central claim is false.

Figures

Figures reproduced from arXiv: 2605.19695 by Samuele Cornell, Zhong-Qiu Wang.

**Figure 2.** Figure 2: System overview. (a) Training Stage: CTRnet is trained in a semisupervised manner on real-recorded pairs of close-talk and far-field mixtures to estimate close-talk speech (see Section IV-D). The estimate is then used as pseudo-labels for training PuLSS in a supervised fashion on real-recorded farfield mixtures (see Section V-D). In PuLSS, oracle speaker-activity timestamps are used in input features to … view at source ↗

**Figure 3.** Figure 3: Illustration of unsupervised CTRnet. Best viewed in color. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of sparse and time-varying speaker overlap. Each colored [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of PuLSS. Best viewed in color. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Illustration of block-wise inference. tive blocks (i.e., extracting a 12-second block every 1 second), resulting in 123, 339 blocks (∼411 hours) for model training. For the inference of CTRnet and PuLSS, we apply the trained models block-wise to process each session, and stitch the processing results along time. See [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

In conversational speech separation and recognition tasks, close-talk microphones are typically attached to each speaker during training data collection to capture near-field, close-talk mixture signals, in addition to using far-field microphones to record far-field mixture signals. Each such close-talk mixture exhibits a reasonably high energy level for the wearer and could intuitively serve as weak supervision for training far-field speech separation models directly on real-recorded far-field signals. However, they are not sufficiently clean for this purpose, as they often contain strong cross-talk speech from other speakers in addition to background noise. To address this, we propose cross-talk reduction (CTR), a task aiming to isolate the wearer's speech from each close-talk mixture, and a novel method called CTRnet, which can be trained directly on real-recorded pairs of close-talk and far-field mixtures to accomplish CTR. Building on CTRnet, we further propose pseudo-label based far-field speech separation (PuLSS), which uses CTRnet's estimated clean speech as pseudo-labels to train models for separating far-field mixtures. A key advantage of the proposed framework is that both CTRnet and PuLSS can be trained on real-recorded data from the target domain, addressing the generalization gap commonly observed when models are trained exclusively on simulated data. On the CHiME-6 dataset, our framework achieves state-of-the-art ASR performance under both oracle and estimated speaker diarization, surpassing all CHiME-{7,8} challenge submissions. To our knowledge, it is the first neural speech separation method that substantially outperforms guided source separation on real conversational "speech-in-the-wild" data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a workable route to train separation models on real paired close-talk/far-field recordings by first reducing cross-talk then using the outputs as pseudo-labels, but the quality of those labels is not directly checked.

read the letter

The core idea is straightforward and useful. They define cross-talk reduction as a separate task, train CTRnet on actual recorded close-talk and far-field pairs to pull out the wearer's speech, then feed those estimates into PuLSS to supervise far-field separation. This lets both stages stay in the target domain instead of relying on simulated mixtures, which is a practical step for conversational data like CHiME-6.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes CTRnet, a model trained directly on real-recorded close-talk/far-field mixture pairs to perform cross-talk reduction (CTR) on close-talk signals, followed by PuLSS which uses the resulting estimates as pseudo-labels to train far-field speech separation. On CHiME-6, the framework reports state-of-the-art ASR word error rates under both oracle and estimated diarization, surpassing prior CHiME-7/8 submissions and guided source separation on real conversational data.

Significance. If the central claims hold after verification of pseudo-label quality, the work would be significant for demonstrating that real-domain training via auxiliary close-talk microphones can close the simulation-to-real gap in speech separation and recognition. The two-stage design and explicit use of real pairs address a persistent practical limitation; credit is due for focusing on held-out real evaluation rather than simulated data alone.

major comments (3)

[§3.2] §3.2 (PuLSS description): The claim that CTRnet outputs from real close-talk mixtures provide effective pseudo-labels for far-field training is load-bearing for the reported ASR gains, yet no quantitative verification (SI-SDR, PESQ, or oracle ASR delta on held-out real segments) is supplied to show the estimates are sufficiently clean rather than noisy; residual cross-talk or artifacts could explain or undermine the improvements over guided source separation.
[§4] §4 (Experimental results): The SOTA ASR numbers on CHiME-6 under estimated diarization lack ablations isolating the contribution of CTRnet pseudo-labels versus raw close-talk signals or simulated-data baselines, and no statistical significance tests or error bars are reported to confirm the gains are robust rather than dataset-specific.
[§2.2] §2.2 (CTRnet training): The supervision mechanism for training CTRnet on real pairs without clean targets is not fully specified; if the loss relies on far-field mixtures in a way that introduces circular dependence, the pseudo-label step risks reducing to a fitted quantity rather than providing independent supervision.

minor comments (2)

[§3] Notation for the two-stage pipeline (CTRnet then PuLSS) should be introduced with a single diagram or equation block to avoid repeated re-definition across sections.
[Table 1] Table 1 (baseline comparisons) would benefit from explicit column for training data type (real vs. simulated) to highlight the domain-gap advantage claimed in the abstract.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, indicating where we will revise the manuscript to strengthen the presentation and where we provide additional clarification.

read point-by-point responses

Referee: [§3.2] §3.2 (PuLSS description): The claim that CTRnet outputs from real close-talk mixtures provide effective pseudo-labels for far-field training is load-bearing for the reported ASR gains, yet no quantitative verification (SI-SDR, PESQ, or oracle ASR delta on held-out real segments) is supplied to show the estimates are sufficiently clean rather than noisy; residual cross-talk or artifacts could explain or undermine the improvements over guided source separation.

Authors: We agree that explicit quantitative verification of pseudo-label quality would strengthen the manuscript. In the revision we will add SI-SDR and PESQ results on held-out real close-talk segments (where reference signals permit) together with an oracle-ASR delta obtained by feeding CTRnet outputs directly into the recognizer. These metrics will be reported alongside the existing end-to-end ASR results on CHiME-6 to demonstrate that the pseudo-labels are sufficiently clean to drive the observed gains over guided source separation. revision: yes
Referee: [§4] §4 (Experimental results): The SOTA ASR numbers on CHiME-6 under estimated diarization lack ablations isolating the contribution of CTRnet pseudo-labels versus raw close-talk signals or simulated-data baselines, and no statistical significance tests or error bars are reported to confirm the gains are robust rather than dataset-specific.

Authors: We will incorporate the requested ablations in the revised manuscript: (i) PuLSS trained on raw close-talk signals without CTRnet, (ii) PuLSS trained exclusively on simulated data, and (iii) the full CTRnet + PuLSS pipeline. We will also report error bars obtained from multiple independent training runs and include paired statistical significance tests on the WER differences to establish that the improvements are robust. revision: yes
Referee: [§2.2] §2.2 (CTRnet training): The supervision mechanism for training CTRnet on real pairs without clean targets is not fully specified; if the loss relies on far-field mixtures in a way that introduces circular dependence, the pseudo-label step risks reducing to a fitted quantity rather than providing independent supervision.

Authors: Section 2.2 specifies that CTRnet is trained by minimizing a composite loss comprising a reconstruction term on the close-talk output and a cross-domain consistency term that aligns the estimated wearer speech with the corresponding far-field mixture after accounting for acoustic differences. The far-field signal is used only as an auxiliary reference for the shared speech content and is never employed as a direct target for the close-talk output; therefore the supervision remains independent. We will expand the loss formulation with explicit equations in the revision to remove any remaining ambiguity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; training and pseudo-label steps remain independent of final metrics

full rationale

The derivation chain begins with CTRnet trained directly on real-recorded close-talk/far-field pairs to produce cross-talk-reduced estimates, which are then used as pseudo-labels to train PuLSS for far-field separation. No quoted equations, self-citations, or fitted parameters in the abstract or described framework reduce the reported CHiME-6 ASR gains by construction to the input pairs or to a renamed version of the same supervision signal. The pseudo-label quality assumption is an empirical claim subject to verification on held-out data rather than a definitional loop, and the SOTA result is presented as an observed outcome rather than a statistical necessity from the training setup itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claims rest on standard deep-learning assumptions for audio processing plus two new model constructs; no new physical constants or unproven mathematical lemmas are introduced beyond typical neural-network training.

axioms (2)

domain assumption Close-talk mixtures contain recoverable information about the wearer's speech that a neural network can isolate from cross-talk.
Invoked when stating that CTRnet can be trained directly on real close-talk/far-field pairs.
domain assumption Pseudo-labels generated by CTRnet are sufficiently accurate to supervise far-field separation training.
This premise is required for the PuLSS stage to improve rather than degrade performance.

invented entities (2)

CTRnet no independent evidence
purpose: Neural network for cross-talk reduction on close-talk mixtures
New model introduced to accomplish the CTR task.
PuLSS no independent evidence
purpose: Pseudo-label based far-field speech separation framework
Overall training pipeline that uses CTRnet outputs.

pith-pipeline@v0.9.0 · 5820 in / 1627 out tokens · 87416 ms · 2026-05-20T01:50:03.816850+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages

[1]

Comon and C

P. Comon and C. Jutten,Handbook of Blind Source Separation: Inde- pendent component analysis and applications. Academic press, 2010

work page 2010
[2]

The Cocktail Party Problem,

J. H. McDermott, “The Cocktail Party Problem,”Current Biology, vol. 19, no. 22, pp. 1024–1027, 2009

work page 2009
[3]

Supervised Speech Separation Based on Deep Learning: An Overview,

D. Wang and J. Chen, “Supervised Speech Separation Based on Deep Learning: An Overview,”IEEE/ACM Trans. Audio, Speech, Lang. Pro- cess., vol. 26, no. 10, pp. 1702–1726, 2018

work page 2018
[4]

30+ Years of Source Separation Research: Achievements and Future Challenges,

S. Araki, N. Ito, R. Haeb-Umbach, G. Wichern, Z.-Q. Wang, and Y . Mitsufuji, “30+ Years of Source Separation Research: Achievements and Future Challenges,” inProc. ICASSP, 2025

work page 2025
[5]

Far-Field Automatic Speech Recognition,

R. Haeb-Umbach, J. Heymann, L. Drude, S. Watanabe, M. Delcroix, and T. Nakatani, “Far-Field Automatic Speech Recognition,”Proc. IEEE, vol. 109, no. 2, pp. 124–148, 2021

work page 2021
[6]

Microphone Array Signal Processing and Deep Learning for Speech Enhancement: Combining Model-Based and Data-Driven Ap- proaches to Parameter Estimation and Filtering,

R. Haeb-Umbach, T. Nakatani, M. Delcroix, C. Boeddeker, and T. Ochiai, “Microphone Array Signal Processing and Deep Learning for Speech Enhancement: Combining Model-Based and Data-Driven Ap- proaches to Parameter Estimation and Filtering,”IEEE Signal Process. Mag., vol. 41, no. 6, pp. 12–23, 2025

work page 2025
[7]

TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation,

Z.-Q. Wang, S. Cornell, S. Choi, Y . Lee, B.-Y . Kim, and S. Watanabe, “TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 3221–3236, 2023

work page 2023
[8]

Closing The Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions,

W. Zhang, J. Shi, C. Li, S. Watanabe, and Y . Qian, “Closing The Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions,” inProc. WASPAA, 2021, pp. 146–150

work page 2021
[9]

Real-M: Towards Speech Separation on Real Mixtures,

C. Subakan, M. Ravanelli, S. Cornell, and F. Grondin, “Real-M: Towards Speech Separation on Real Mixtures,” inProc. ICASSP, 2022, pp. 6862– 6866

work page 2022
[10]

Summary of the NOTSOFAR-1 challenge: Highlights and learnings,

I. Abramovski, A. Vinnikov, S. Shaeret al., “Summary of the NOTSOFAR-1 challenge: Highlights and learnings,”Comput. Speech Lang., vol. 93, p. 101796, 2025

work page 2025
[11]

Recent Trends in Distant Conversational Speech Recogni- tion: A Review of CHiME-7 and 8 DASR Challenges,

S. Cornell, C. Boeddeker, T. Park, H. Huang, D. Raj, M. Wiesner, Y . Masuyama, X. Chang, Z.-Q. Wang, S. Squartini, P. Garcia, and S. Watanabe, “Recent Trends in Distant Conversational Speech Recogni- tion: A Review of CHiME-7 and 8 DASR Challenges,”Comput. Speech Lang., vol. 97, 2026

work page 2026
[12]

An End-to-End Integration of Speech Sepa- ration and Recognition with Self-Supervised Learning Representation,

Y . Masuyama, X. Chang, W. Zhang, S. Cornell, Z.-Q. Wang, N. Ono, Y . Qian, and S. Watanabe, “An End-to-End Integration of Speech Sepa- ration and Recognition with Self-Supervised Learning Representation,” Comput. Speech Lang., vol. 95, p. 101813, 2026

work page 2026
[13]

The AMI Meeting Corpus: A Pre-Announcement,

J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemotet al., “The AMI Meeting Corpus: A Pre-Announcement,” inMachine Learning for Multimodal Interaction, 2006, pp. 28–39. 16

work page 2006
[14]

Summary on The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge,

F. Yu, S. Zhang, P. Guo, Y . Fu, Z. Duet al., “Summary on The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge,” inProc. ICASSP, 2022, pp. 9156–9160

work page 2022
[15]

The Fifth ’CHiME’ Speech Separation and Recognition Challenge: Dataset, Task and Baselines,

J. Barker, S. Watanabe, E. Vincentet al., “The Fifth ’CHiME’ Speech Separation and Recognition Challenge: Dataset, Task and Baselines,” in Proc. Interspeech, 2018, pp. 1561–1565

work page 2018
[16]

The Multimodal Information Based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition,

Z. Wang, S. Wu, H. Chenet al., “The Multimodal Information Based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition,” inProc. ICASSP, 2023, pp. 1–5

work page 2022
[17]

Cross-Talk Reduction,

Z.-Q. Wang, A. Kumar, and S. Watanabe, “Cross-Talk Reduction,” in Proc. IJCAI, 2024, pp. 5171–5180

work page 2024
[18]

BLSTM Supported GEV Beamformer Front-End for The 3rd CHiME Challenge,

J. Heymann, L. Drude, A. Chinaev, and R. Haeb-Umbach, “BLSTM Supported GEV Beamformer Front-End for The 3rd CHiME Challenge,” inProc. ASRU, 2015, pp. 444–451

work page 2015
[19]

Improved MVDR Beamforming using Single-Channel Mask Prediction Networks,

H. Erdogan, J. R. Hershey, S. Watanabe, I. Mandel, and J. Le Roux, “Improved MVDR Beamforming using Single-Channel Mask Prediction Networks,” inProc. Interspeech, 2016, pp. 1981–1985

work page 2016
[20]

How Bad Are Artifacts?: Analyzing The Impact of Speech Enhancement Errors on ASR,

K. Iwamoto, T. Ochiai, M. Delcroix, R. Ikeshita, H. Sato, S. Araki, and S. Katagiri, “How Bad Are Artifacts?: Analyzing The Impact of Speech Enhancement Errors on ASR,”Proc. Interspeech, pp. 5418–5422, 2022

work page 2022
[21]

VarArray Meets t-SOT: Advancing The State of The Art of Streaming Distant Conversational Speech Recognition,

N. Kanda, J. Wu, X. Wang, Z. Chen, J. Li, and T. Yoshioka, “VarArray Meets t-SOT: Advancing The State of The Art of Streaming Distant Conversational Speech Recognition,” inProc. ICASSP, 2023, pp. 1–5

work page 2023
[22]

Unsupervised Sound Separation using Mixture Invariant Training,

S. Wisdom, E. Tzinis, H. Erdogan, R. Weiss, K. Wilson, and J. R. Hershey, “Unsupervised Sound Separation using Mixture Invariant Training,”Proc. NeurIPS, vol. 33, pp. 3846–3857, 2020

work page 2020
[23]

Teacher-Student MixIT for Unsupervised and Semi-supervised Speech Separation,

J. Zhang, C. Zorila, R. Doddipatla, and J. Barker, “Teacher-Student MixIT for Unsupervised and Semi-supervised Speech Separation,” in Proc. Interspeech, 2021

work page 2021
[24]

Self-Remixing: Unsupervised Speech Separa- tion via Separation and Remixing,

K. Saijo and T. Ogawa, “Self-Remixing: Unsupervised Speech Separa- tion via Separation and Remixing,” inProc. ICASSP, 2023, pp. 1–5

work page 2023
[25]

Unsupervised Multi- Channel Separation and Adaptation,

C. Han, K. Wilson, S. Wisdom, and J. R. Hershey, “Unsupervised Multi- Channel Separation and Adaptation,” inProc. ICASSP, 2024, pp. 721– 725

work page 2024
[26]

Adapting Speech Separation to Real-World Meetings using Mixture Invariant Training,

A. Sivaraman, S. Wisdom, H. Erdogan, and J. R. Hershey, “Adapting Speech Separation to Real-World Meetings using Mixture Invariant Training,” inProc. ICASSP, 2022, pp. 686–690

work page 2022
[27]

Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In- the-wild Unsupervised Sound Separation,

S. Wisdom, A. Jansen, R. J. Weiss, H. Erdogan, and J. R. Hershey, “Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In- the-wild Unsupervised Sound Separation,” inProc. WASPAA, 2021, pp. 51–55

work page 2021
[28]

UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-Determined Training Mixtures,

Z.-Q. Wang and S. Watanabe, “UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-Determined Training Mixtures,” inProc. NeurIPS, vol. 36, 2023, pp. 34 021–34 042

work page 2023
[29]

Enhanced Reverberation as Supervision for Unsupervised Speech Separation,

K. Saijo, G. Wichern, F. G. Germain, Z. Pan, and J. Le Roux, “Enhanced Reverberation as Supervision for Unsupervised Speech Separation,” in Proc. Interspeech, 2024, pp. 607–611

work page 2024
[30]

Spatial Loss for Unsupervised Multi-channel Source Separation,

K. Saijo and R. Scheibler, “Spatial Loss for Unsupervised Multi-channel Source Separation,” inProc. Interspeech, 2022, pp. 241–245

work page 2022
[31]

Front-End Processing for The CHiME-5 Dinner Party Scenario,

C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Hey- mann, and R. Haeb-Umbach, “Front-End Processing for The CHiME-5 Dinner Party Scenario,” inProc. CHiME, vol. 1, 2018

work page 2018
[32]

VarArray: Array-Geometry-Agnostic Continuous Speech Separation,

T. Yoshioka, X. Wang, D. Wang, M. Tang, Z. Zhu, Z. Chen, and N. Kanda, “VarArray: Array-Geometry-Agnostic Continuous Speech Separation,” inProc. ICASSP, 2022, pp. 6027–6031

work page 2022
[33]

Continuous speech separation: Dataset and analysis,

Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y . Luo, J. Wu, X. Xiao, and J. Li, “Continuous speech separation: Dataset and analysis,” inProc. ICASSP, 2020, pp. 7284–7288

work page 2020
[34]

NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription,

A. Vinnikov, A. Ivry, A. Hurvitzet al., “NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription,” in Proc. Interspeech, 2024

work page 2024
[35]

Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization,

T. V on Neumann, C. Boeddeker, T. Cord-Landwehr, M. Delcroix, and R. Haeb-Umbach, “Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization,” inProc. HSCMA, 2024, pp. 775–779

work page 2024
[36]

Neural fast full-rank spatial covariance analysis for blind source separation,

Y . Bando, Y . Masuyama, A. A. Nugraha, and K. Yoshii, “Neural fast full-rank spatial covariance analysis for blind source separation,” inProc. EUSIPCO, 2023, pp. 51–55

work page 2023
[37]

Neural Blind Source Separation and Diarization for Distant Speech Recognition,

Y . Bando, T. Nakamura, and S. Watanabe, “Neural Blind Source Separation and Diarization for Distant Speech Recognition,” inProc. Interspeech, 2024, pp. 722–726

work page 2024
[38]

Investigation of Spatial Self-Supervised Learning and Its Application to Target Speaker Speech Recognition,

Y . Bando, S. Cornell, S. Fukayama, and S. Watanabe, “Investigation of Spatial Self-Supervised Learning and Its Application to Target Speaker Speech Recognition,” inProc. ICASSP, 2025, pp. 1–5

work page 2025
[39]

ctPuLSE: Close-Talk, and Pseudo-Label Based Far-Field, Speech Enhancement,

Z.-Q. Wang, “ctPuLSE: Close-Talk, and Pseudo-Label Based Far-Field, Speech Enhancement,”J. Acoust. Soc. Am., vol. 158, no. 4, pp. 2849– 2862, 2025

work page 2025
[40]

SuperM2M: Supervised and Mixture-to-Mixture Co-Learning for Speech Enhancement and Robust ASR,

——, “SuperM2M: Supervised and Mixture-to-Mixture Co-Learning for Speech Enhancement and Robust ASR,”Neural Networks, vol. 188, no. 107408, pp. 1–16, 2025

work page 2025
[41]

Mixture to Mixture: Leveraging Close-Talk Mixtures as Weak- Supervision for Speech Separation,

——, “Mixture to Mixture: Leveraging Close-Talk Mixtures as Weak- Supervision for Speech Separation,”IEEE Signal Process. Lett., vol. 31, pp. 1715–1719, 2024

work page 2024
[42]

SuPseudo: A Pseudo-supervised Learning Method for Neural Speech Enhancement in Far-field Speech Recogni- tion,

L. Luo, L. Li, and Q. Hong, “SuPseudo: A Pseudo-supervised Learning Method for Neural Speech Enhancement in Far-field Speech Recogni- tion,” inProc. Interspeech, 2025, pp. 3404–3408

work page 2025
[43]

Pseudo Labels-Based Neural Speech Enhancement for the A VSR Task in The MISP-Meeting Challenge,

L. Luo, S. Lu, L. Li, and Q. Hong, “Pseudo Labels-Based Neural Speech Enhancement for the A VSR Task in The MISP-Meeting Challenge,” in Proc. Interspeech, 2025, pp. 1883–1887

work page 2025
[44]

Relative Transfer Function Iden- tification using Convolutive Transfer Function Approximation,

R. Talmon, I. Cohen, and S. Gannot, “Relative Transfer Function Iden- tification using Convolutive Transfer Function Approximation,”IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 4, pp. 546–555, 2009

work page 2009
[45]

A Consolidated Perspective on Multi- Microphone Speech Enhancement and Source Separation,

S. Gannot, E. Vincentet al., “A Consolidated Perspective on Multi- Microphone Speech Enhancement and Source Separation,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, pp. 692–730, 2017

work page 2017
[46]

Understanding Blind Deconvolution Algorithms,

A. Levin, Y . Weiss, F. Durand, and W. T. Freeman, “Understanding Blind Deconvolution Algorithms,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 12, pp. 2354–2367, 2011

work page 2011
[47]

Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation,

Z.-Q. Wang, G. Wichern, and J. Le Roux, “Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3476–3490, 2021

work page 2021
[48]

Differentiable Consistency Constraints for Improved Deep Speech Enhancement,

S. Wisdom, J. R. Hershey, K. Wilson, J. Thorpe, M. Chinen, B. Patton, and R. A. Saurous, “Differentiable Consistency Constraints for Improved Deep Speech Enhancement,” inProc. ICASSP, 2019, pp. 900–904

work page 2019
[49]

Multitalker Speech Separation with Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks,

M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker Speech Separation with Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 10, pp. 1901–1913, 2017

work page 1901
[50]

Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System De- scription, Comparison, and Analysis,

D. Raj, P. Denisov, Z. Chen, H. Erdogan, Z. Huang, M. He, S. Watanabe, J. Du, T. Yoshioka, Y . Luoet al., “Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System De- scription, Comparison, and Analysis,” inProc. SLT, 2021, pp. 897–904

work page 2021
[51]

CHiME-6 Challenge: Tack- ling Multispeaker Speech Recognition for Unsegmented Recordings,

S. Watanabe, M. Mandel, J. Barkeret al., “CHiME-6 Challenge: Tack- ling Multispeaker Speech Recognition for Unsegmented Recordings,” in Proc. CHiME, 2020, pp. 1–7

work page 2020
[52]

The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multi- ple Devices in Diverse Scenarios,

S. Cornell, M. S. Wiesner, S. Watanabe, D. Raj, X. Chang, P. Garcia, Y . Masuyam, Z.-Q. Wang, S. Squartini, and S. Khudanpur, “The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multi- ple Devices in Diverse Scenarios,” inProc. CHiME, 2023, pp. 1–6

work page 2023
[53]

The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization,

S. Cornell, T. J. Park, H. Huang, C. Boeddeker, X. Chang, M. Maciejew- ski, M. S. Wiesner, P. Garcia, and S. Watanabe, “The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization,” inProc. CHiME, 2024, pp. 1– 6

work page 2024
[54]

The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge,

R. Wang, M. He, J. Duet al., “The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge,” inProc. CHiME, 2023, pp. 13–18

work page 2023
[55]

The IACAS- Thinkit System for CHiME-7 Challenge,

L. Ye, H. Lu, G. Cheng, Y . Chen, Z. Shang, and X. Li, “The IACAS- Thinkit System for CHiME-7 Challenge,” inProc. CHiME, 2023, pp. 23–26

work page 2023
[56]

Multi-Microphone Complex Spec- tral Mapping for Utterance-Wise and Continuous Speech Separation,

Z.-Q. Wang, P. Wang, and D. Wang, “Multi-Microphone Complex Spec- tral Mapping for Utterance-Wise and Continuous Speech Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 2001– 2014, 2021

work page 2001
[57]

Librispeech: An ASR Corpus Based on Public Domain Audio Books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR Corpus Based on Public Domain Audio Books,”Proc. ICASSP, pp. 5206–5210, 2015

work page 2015
[58]

EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation,

J. Richter, Y . C. Wu, S. Krenn, S. Welker, B. Lay, S. Watanabe, A. Richard, and T. Gerkmann, “EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation,” inProc. Interspeech, 2024, pp. 4873–4877

work page 2024
[59]

FSD50K: An Open Dataset of Human-Labeled Sound Events,

E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: An Open Dataset of Human-Labeled Sound Events,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 829–852, 2022

work page 2022
[60]

A Summary of The REVERB Challenge: State-of-The-Art and Remaining Challenges in Re- verberant Speech Processing Research,

K. Kinoshita, M. Delcroix, S. Gannotet al., “A Summary of The REVERB Challenge: State-of-The-Art and Remaining Challenges in Re- verberant Speech Processing Research,”Eurasip J. Adv. Signal Process., vol. 2016, no. 1, pp. 1–19, 2016

work page 2016
[61]

Word Error Rate Definitions and Algorithms for Long-Form Multi- Talker Speech Recognition,

T. von Neumann, C. Boeddeker, M. Delcroix, and R. Haeb-Umbach, “Word Error Rate Definitions and Algorithms for Long-Form Multi- Talker Speech Recognition,”IEEE Trans. Audio, Speech, Lang. Process., 2025. 17

work page 2025
[62]

arXiv preprint arXiv:2509.14128 , year =

M. Sekoyan, N. R. Koluguri, N. Tadevosyan, P. Zelasko, T. Bartley, N. Karpov, J. Balam, and B. Ginsburg, “Canary-1B-v2 & Parakeet-TDT- 0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST,”arXiv preprint arXiv:2509.14128, 2025

work page arXiv 2025
[63]

arXiv preprint arXiv:1909.09577 , year=

O. Kuchaiev, J. Li, H. Nguyen, O. Hrinchuk, R. Leary, B. Ginsburg, S. Kriman, S. Beliaev, V . Lavrukhin, J. Cooket al., “NeMo: A Toolkit for Building AI Applications using Neural Modules,”arXiv preprint arXiv:1909.09577, 2019

work page arXiv 1909
[64]

BUT CHiME-7 System Description,

M. Karafiat, K. Vesel ´y, I. Szoke, L. Mosner, K. Benes, M. Witkowski, R. G. Barchi, and L. D. Pepino, “BUT CHiME-7 System Description,” inProc. CHiME, 2023, pp. 67–72

work page 2023
[65]

The NPU System for DASR Task of CHiME-7 Challenge,

B. Mu, P. Guo, H. Wang, Y . Li, Y . Li, P. Zhou, W. Chen, and L. Xie, “The NPU System for DASR Task of CHiME-7 Challenge,” inProc. CHiME, 2023, pp. 63–66

work page 2023
[66]

The University of Cambridge System for the CHiME-7 DASR Task,

K. Deng, X. Zheng, and P. Woodland, “The University of Cambridge System for the CHiME-7 DASR Task,” inProc. CHiME, 2023, pp. 73– 76

work page 2023
[67]

NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge,

N. Kamo, N. Tawara, A. Andoet al., “NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge,” inProc. CHiME, 2024, pp. 69–74

work page 2024
[68]

STCON System for the CHiME-8 Challenge,

A. Mitrofanov, T. Prisyach, T. Timofeevaet al., “STCON System for the CHiME-8 Challenge,” inProc. CHiME, 2024, pp. 13–17

work page 2024
[69]

The Second DIHARD Diarization Challenge: Dataset, Task, and Baselines,

N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, and M. Liberman, “The Second DIHARD Diarization Challenge: Dataset, Task, and Baselines,” inProc. Interspeech, 2019, pp. 978–982

work page 2019
[70]

pyannote.audio: Neural Building Blocks for Speaker Diarization,

H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeuxet al., “pyannote.audio: Neural Building Blocks for Speaker Diarization,” inProc. ICASSP, 2020, pp. 7124–7128

work page 2020
[71]

Integrating End-to-End Neural and Clustering-Based Diarization: Getting The Best of Both Worlds,

K. Kinoshita, M. Delcroix, and N. Tawara, “Integrating End-to-End Neural and Clustering-Based Diarization: Getting The Best of Both Worlds,” inProc. ICASSP, 2021, pp. 7198–7202

work page 2021
[72]

Boosting Unknown-Number Speaker Separation with Transformer Decoder-Based Attractor,

Y . Lee, S. Choi, B. Y . Kim, Z. Q. Wang, and S. Watanabe, “Boosting Unknown-Number Speaker Separation with Transformer Decoder-Based Attractor,” inProc. ICASSP, 2024, pp. 446–450. Zhong-Qiu Wangreceived the B.E. degree in com- puter science and technology from Harbin Institute of Technology, Harbin, China, in2013, and the Ph.D. degree in computer science...

work page 2024

[1] [1]

Comon and C

P. Comon and C. Jutten,Handbook of Blind Source Separation: Inde- pendent component analysis and applications. Academic press, 2010

work page 2010

[2] [2]

The Cocktail Party Problem,

J. H. McDermott, “The Cocktail Party Problem,”Current Biology, vol. 19, no. 22, pp. 1024–1027, 2009

work page 2009

[3] [3]

Supervised Speech Separation Based on Deep Learning: An Overview,

D. Wang and J. Chen, “Supervised Speech Separation Based on Deep Learning: An Overview,”IEEE/ACM Trans. Audio, Speech, Lang. Pro- cess., vol. 26, no. 10, pp. 1702–1726, 2018

work page 2018

[4] [4]

30+ Years of Source Separation Research: Achievements and Future Challenges,

S. Araki, N. Ito, R. Haeb-Umbach, G. Wichern, Z.-Q. Wang, and Y . Mitsufuji, “30+ Years of Source Separation Research: Achievements and Future Challenges,” inProc. ICASSP, 2025

work page 2025

[5] [5]

Far-Field Automatic Speech Recognition,

R. Haeb-Umbach, J. Heymann, L. Drude, S. Watanabe, M. Delcroix, and T. Nakatani, “Far-Field Automatic Speech Recognition,”Proc. IEEE, vol. 109, no. 2, pp. 124–148, 2021

work page 2021

[6] [6]

Microphone Array Signal Processing and Deep Learning for Speech Enhancement: Combining Model-Based and Data-Driven Ap- proaches to Parameter Estimation and Filtering,

R. Haeb-Umbach, T. Nakatani, M. Delcroix, C. Boeddeker, and T. Ochiai, “Microphone Array Signal Processing and Deep Learning for Speech Enhancement: Combining Model-Based and Data-Driven Ap- proaches to Parameter Estimation and Filtering,”IEEE Signal Process. Mag., vol. 41, no. 6, pp. 12–23, 2025

work page 2025

[7] [7]

TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation,

Z.-Q. Wang, S. Cornell, S. Choi, Y . Lee, B.-Y . Kim, and S. Watanabe, “TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 3221–3236, 2023

work page 2023

[8] [8]

Closing The Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions,

W. Zhang, J. Shi, C. Li, S. Watanabe, and Y . Qian, “Closing The Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions,” inProc. WASPAA, 2021, pp. 146–150

work page 2021

[9] [9]

Real-M: Towards Speech Separation on Real Mixtures,

C. Subakan, M. Ravanelli, S. Cornell, and F. Grondin, “Real-M: Towards Speech Separation on Real Mixtures,” inProc. ICASSP, 2022, pp. 6862– 6866

work page 2022

[10] [10]

Summary of the NOTSOFAR-1 challenge: Highlights and learnings,

I. Abramovski, A. Vinnikov, S. Shaeret al., “Summary of the NOTSOFAR-1 challenge: Highlights and learnings,”Comput. Speech Lang., vol. 93, p. 101796, 2025

work page 2025

[11] [11]

Recent Trends in Distant Conversational Speech Recogni- tion: A Review of CHiME-7 and 8 DASR Challenges,

S. Cornell, C. Boeddeker, T. Park, H. Huang, D. Raj, M. Wiesner, Y . Masuyama, X. Chang, Z.-Q. Wang, S. Squartini, P. Garcia, and S. Watanabe, “Recent Trends in Distant Conversational Speech Recogni- tion: A Review of CHiME-7 and 8 DASR Challenges,”Comput. Speech Lang., vol. 97, 2026

work page 2026

[12] [12]

An End-to-End Integration of Speech Sepa- ration and Recognition with Self-Supervised Learning Representation,

Y . Masuyama, X. Chang, W. Zhang, S. Cornell, Z.-Q. Wang, N. Ono, Y . Qian, and S. Watanabe, “An End-to-End Integration of Speech Sepa- ration and Recognition with Self-Supervised Learning Representation,” Comput. Speech Lang., vol. 95, p. 101813, 2026

work page 2026

[13] [13]

The AMI Meeting Corpus: A Pre-Announcement,

J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemotet al., “The AMI Meeting Corpus: A Pre-Announcement,” inMachine Learning for Multimodal Interaction, 2006, pp. 28–39. 16

work page 2006

[14] [14]

Summary on The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge,

F. Yu, S. Zhang, P. Guo, Y . Fu, Z. Duet al., “Summary on The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge,” inProc. ICASSP, 2022, pp. 9156–9160

work page 2022

[15] [15]

The Fifth ’CHiME’ Speech Separation and Recognition Challenge: Dataset, Task and Baselines,

J. Barker, S. Watanabe, E. Vincentet al., “The Fifth ’CHiME’ Speech Separation and Recognition Challenge: Dataset, Task and Baselines,” in Proc. Interspeech, 2018, pp. 1561–1565

work page 2018

[16] [16]

The Multimodal Information Based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition,

Z. Wang, S. Wu, H. Chenet al., “The Multimodal Information Based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition,” inProc. ICASSP, 2023, pp. 1–5

work page 2022

[17] [17]

Cross-Talk Reduction,

Z.-Q. Wang, A. Kumar, and S. Watanabe, “Cross-Talk Reduction,” in Proc. IJCAI, 2024, pp. 5171–5180

work page 2024

[18] [18]

BLSTM Supported GEV Beamformer Front-End for The 3rd CHiME Challenge,

J. Heymann, L. Drude, A. Chinaev, and R. Haeb-Umbach, “BLSTM Supported GEV Beamformer Front-End for The 3rd CHiME Challenge,” inProc. ASRU, 2015, pp. 444–451

work page 2015

[19] [19]

Improved MVDR Beamforming using Single-Channel Mask Prediction Networks,

H. Erdogan, J. R. Hershey, S. Watanabe, I. Mandel, and J. Le Roux, “Improved MVDR Beamforming using Single-Channel Mask Prediction Networks,” inProc. Interspeech, 2016, pp. 1981–1985

work page 2016

[20] [20]

How Bad Are Artifacts?: Analyzing The Impact of Speech Enhancement Errors on ASR,

K. Iwamoto, T. Ochiai, M. Delcroix, R. Ikeshita, H. Sato, S. Araki, and S. Katagiri, “How Bad Are Artifacts?: Analyzing The Impact of Speech Enhancement Errors on ASR,”Proc. Interspeech, pp. 5418–5422, 2022

work page 2022

[21] [21]

VarArray Meets t-SOT: Advancing The State of The Art of Streaming Distant Conversational Speech Recognition,

N. Kanda, J. Wu, X. Wang, Z. Chen, J. Li, and T. Yoshioka, “VarArray Meets t-SOT: Advancing The State of The Art of Streaming Distant Conversational Speech Recognition,” inProc. ICASSP, 2023, pp. 1–5

work page 2023

[22] [22]

Unsupervised Sound Separation using Mixture Invariant Training,

S. Wisdom, E. Tzinis, H. Erdogan, R. Weiss, K. Wilson, and J. R. Hershey, “Unsupervised Sound Separation using Mixture Invariant Training,”Proc. NeurIPS, vol. 33, pp. 3846–3857, 2020

work page 2020

[23] [23]

Teacher-Student MixIT for Unsupervised and Semi-supervised Speech Separation,

J. Zhang, C. Zorila, R. Doddipatla, and J. Barker, “Teacher-Student MixIT for Unsupervised and Semi-supervised Speech Separation,” in Proc. Interspeech, 2021

work page 2021

[24] [24]

Self-Remixing: Unsupervised Speech Separa- tion via Separation and Remixing,

K. Saijo and T. Ogawa, “Self-Remixing: Unsupervised Speech Separa- tion via Separation and Remixing,” inProc. ICASSP, 2023, pp. 1–5

work page 2023

[25] [25]

Unsupervised Multi- Channel Separation and Adaptation,

C. Han, K. Wilson, S. Wisdom, and J. R. Hershey, “Unsupervised Multi- Channel Separation and Adaptation,” inProc. ICASSP, 2024, pp. 721– 725

work page 2024

[26] [26]

Adapting Speech Separation to Real-World Meetings using Mixture Invariant Training,

A. Sivaraman, S. Wisdom, H. Erdogan, and J. R. Hershey, “Adapting Speech Separation to Real-World Meetings using Mixture Invariant Training,” inProc. ICASSP, 2022, pp. 686–690

work page 2022

[27] [27]

Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In- the-wild Unsupervised Sound Separation,

S. Wisdom, A. Jansen, R. J. Weiss, H. Erdogan, and J. R. Hershey, “Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In- the-wild Unsupervised Sound Separation,” inProc. WASPAA, 2021, pp. 51–55

work page 2021

[28] [28]

UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-Determined Training Mixtures,

Z.-Q. Wang and S. Watanabe, “UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-Determined Training Mixtures,” inProc. NeurIPS, vol. 36, 2023, pp. 34 021–34 042

work page 2023

[29] [29]

Enhanced Reverberation as Supervision for Unsupervised Speech Separation,

K. Saijo, G. Wichern, F. G. Germain, Z. Pan, and J. Le Roux, “Enhanced Reverberation as Supervision for Unsupervised Speech Separation,” in Proc. Interspeech, 2024, pp. 607–611

work page 2024

[30] [30]

Spatial Loss for Unsupervised Multi-channel Source Separation,

K. Saijo and R. Scheibler, “Spatial Loss for Unsupervised Multi-channel Source Separation,” inProc. Interspeech, 2022, pp. 241–245

work page 2022

[31] [31]

Front-End Processing for The CHiME-5 Dinner Party Scenario,

C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Hey- mann, and R. Haeb-Umbach, “Front-End Processing for The CHiME-5 Dinner Party Scenario,” inProc. CHiME, vol. 1, 2018

work page 2018

[32] [32]

VarArray: Array-Geometry-Agnostic Continuous Speech Separation,

T. Yoshioka, X. Wang, D. Wang, M. Tang, Z. Zhu, Z. Chen, and N. Kanda, “VarArray: Array-Geometry-Agnostic Continuous Speech Separation,” inProc. ICASSP, 2022, pp. 6027–6031

work page 2022

[33] [33]

Continuous speech separation: Dataset and analysis,

Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y . Luo, J. Wu, X. Xiao, and J. Li, “Continuous speech separation: Dataset and analysis,” inProc. ICASSP, 2020, pp. 7284–7288

work page 2020

[34] [34]

NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription,

A. Vinnikov, A. Ivry, A. Hurvitzet al., “NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription,” in Proc. Interspeech, 2024

work page 2024

[35] [35]

Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization,

T. V on Neumann, C. Boeddeker, T. Cord-Landwehr, M. Delcroix, and R. Haeb-Umbach, “Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization,” inProc. HSCMA, 2024, pp. 775–779

work page 2024

[36] [36]

Neural fast full-rank spatial covariance analysis for blind source separation,

Y . Bando, Y . Masuyama, A. A. Nugraha, and K. Yoshii, “Neural fast full-rank spatial covariance analysis for blind source separation,” inProc. EUSIPCO, 2023, pp. 51–55

work page 2023

[37] [37]

Neural Blind Source Separation and Diarization for Distant Speech Recognition,

Y . Bando, T. Nakamura, and S. Watanabe, “Neural Blind Source Separation and Diarization for Distant Speech Recognition,” inProc. Interspeech, 2024, pp. 722–726

work page 2024

[38] [38]

Investigation of Spatial Self-Supervised Learning and Its Application to Target Speaker Speech Recognition,

Y . Bando, S. Cornell, S. Fukayama, and S. Watanabe, “Investigation of Spatial Self-Supervised Learning and Its Application to Target Speaker Speech Recognition,” inProc. ICASSP, 2025, pp. 1–5

work page 2025

[39] [39]

ctPuLSE: Close-Talk, and Pseudo-Label Based Far-Field, Speech Enhancement,

Z.-Q. Wang, “ctPuLSE: Close-Talk, and Pseudo-Label Based Far-Field, Speech Enhancement,”J. Acoust. Soc. Am., vol. 158, no. 4, pp. 2849– 2862, 2025

work page 2025

[40] [40]

SuperM2M: Supervised and Mixture-to-Mixture Co-Learning for Speech Enhancement and Robust ASR,

——, “SuperM2M: Supervised and Mixture-to-Mixture Co-Learning for Speech Enhancement and Robust ASR,”Neural Networks, vol. 188, no. 107408, pp. 1–16, 2025

work page 2025

[41] [41]

Mixture to Mixture: Leveraging Close-Talk Mixtures as Weak- Supervision for Speech Separation,

——, “Mixture to Mixture: Leveraging Close-Talk Mixtures as Weak- Supervision for Speech Separation,”IEEE Signal Process. Lett., vol. 31, pp. 1715–1719, 2024

work page 2024

[42] [42]

SuPseudo: A Pseudo-supervised Learning Method for Neural Speech Enhancement in Far-field Speech Recogni- tion,

L. Luo, L. Li, and Q. Hong, “SuPseudo: A Pseudo-supervised Learning Method for Neural Speech Enhancement in Far-field Speech Recogni- tion,” inProc. Interspeech, 2025, pp. 3404–3408

work page 2025

[43] [43]

Pseudo Labels-Based Neural Speech Enhancement for the A VSR Task in The MISP-Meeting Challenge,

L. Luo, S. Lu, L. Li, and Q. Hong, “Pseudo Labels-Based Neural Speech Enhancement for the A VSR Task in The MISP-Meeting Challenge,” in Proc. Interspeech, 2025, pp. 1883–1887

work page 2025

[44] [44]

Relative Transfer Function Iden- tification using Convolutive Transfer Function Approximation,

R. Talmon, I. Cohen, and S. Gannot, “Relative Transfer Function Iden- tification using Convolutive Transfer Function Approximation,”IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 4, pp. 546–555, 2009

work page 2009

[45] [45]

A Consolidated Perspective on Multi- Microphone Speech Enhancement and Source Separation,

S. Gannot, E. Vincentet al., “A Consolidated Perspective on Multi- Microphone Speech Enhancement and Source Separation,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, pp. 692–730, 2017

work page 2017

[46] [46]

Understanding Blind Deconvolution Algorithms,

A. Levin, Y . Weiss, F. Durand, and W. T. Freeman, “Understanding Blind Deconvolution Algorithms,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 12, pp. 2354–2367, 2011

work page 2011

[47] [47]

Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation,

Z.-Q. Wang, G. Wichern, and J. Le Roux, “Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3476–3490, 2021

work page 2021

[48] [48]

Differentiable Consistency Constraints for Improved Deep Speech Enhancement,

S. Wisdom, J. R. Hershey, K. Wilson, J. Thorpe, M. Chinen, B. Patton, and R. A. Saurous, “Differentiable Consistency Constraints for Improved Deep Speech Enhancement,” inProc. ICASSP, 2019, pp. 900–904

work page 2019

[49] [49]

Multitalker Speech Separation with Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks,

M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker Speech Separation with Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 10, pp. 1901–1913, 2017

work page 1901

[50] [50]

Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System De- scription, Comparison, and Analysis,

D. Raj, P. Denisov, Z. Chen, H. Erdogan, Z. Huang, M. He, S. Watanabe, J. Du, T. Yoshioka, Y . Luoet al., “Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System De- scription, Comparison, and Analysis,” inProc. SLT, 2021, pp. 897–904

work page 2021

[51] [51]

CHiME-6 Challenge: Tack- ling Multispeaker Speech Recognition for Unsegmented Recordings,

S. Watanabe, M. Mandel, J. Barkeret al., “CHiME-6 Challenge: Tack- ling Multispeaker Speech Recognition for Unsegmented Recordings,” in Proc. CHiME, 2020, pp. 1–7

work page 2020

[52] [52]

The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multi- ple Devices in Diverse Scenarios,

S. Cornell, M. S. Wiesner, S. Watanabe, D. Raj, X. Chang, P. Garcia, Y . Masuyam, Z.-Q. Wang, S. Squartini, and S. Khudanpur, “The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multi- ple Devices in Diverse Scenarios,” inProc. CHiME, 2023, pp. 1–6

work page 2023

[53] [53]

The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization,

S. Cornell, T. J. Park, H. Huang, C. Boeddeker, X. Chang, M. Maciejew- ski, M. S. Wiesner, P. Garcia, and S. Watanabe, “The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization,” inProc. CHiME, 2024, pp. 1– 6

work page 2024

[54] [54]

The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge,

R. Wang, M. He, J. Duet al., “The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge,” inProc. CHiME, 2023, pp. 13–18

work page 2023

[55] [55]

The IACAS- Thinkit System for CHiME-7 Challenge,

L. Ye, H. Lu, G. Cheng, Y . Chen, Z. Shang, and X. Li, “The IACAS- Thinkit System for CHiME-7 Challenge,” inProc. CHiME, 2023, pp. 23–26

work page 2023

[56] [56]

Multi-Microphone Complex Spec- tral Mapping for Utterance-Wise and Continuous Speech Separation,

Z.-Q. Wang, P. Wang, and D. Wang, “Multi-Microphone Complex Spec- tral Mapping for Utterance-Wise and Continuous Speech Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 2001– 2014, 2021

work page 2001

[57] [57]

Librispeech: An ASR Corpus Based on Public Domain Audio Books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR Corpus Based on Public Domain Audio Books,”Proc. ICASSP, pp. 5206–5210, 2015

work page 2015

[58] [58]

EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation,

J. Richter, Y . C. Wu, S. Krenn, S. Welker, B. Lay, S. Watanabe, A. Richard, and T. Gerkmann, “EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation,” inProc. Interspeech, 2024, pp. 4873–4877

work page 2024

[59] [59]

FSD50K: An Open Dataset of Human-Labeled Sound Events,

E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: An Open Dataset of Human-Labeled Sound Events,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 829–852, 2022

work page 2022

[60] [60]

A Summary of The REVERB Challenge: State-of-The-Art and Remaining Challenges in Re- verberant Speech Processing Research,

K. Kinoshita, M. Delcroix, S. Gannotet al., “A Summary of The REVERB Challenge: State-of-The-Art and Remaining Challenges in Re- verberant Speech Processing Research,”Eurasip J. Adv. Signal Process., vol. 2016, no. 1, pp. 1–19, 2016

work page 2016

[61] [61]

Word Error Rate Definitions and Algorithms for Long-Form Multi- Talker Speech Recognition,

T. von Neumann, C. Boeddeker, M. Delcroix, and R. Haeb-Umbach, “Word Error Rate Definitions and Algorithms for Long-Form Multi- Talker Speech Recognition,”IEEE Trans. Audio, Speech, Lang. Process., 2025. 17

work page 2025

[62] [62]

arXiv preprint arXiv:2509.14128 , year =

M. Sekoyan, N. R. Koluguri, N. Tadevosyan, P. Zelasko, T. Bartley, N. Karpov, J. Balam, and B. Ginsburg, “Canary-1B-v2 & Parakeet-TDT- 0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST,”arXiv preprint arXiv:2509.14128, 2025

work page arXiv 2025

[63] [63]

arXiv preprint arXiv:1909.09577 , year=

O. Kuchaiev, J. Li, H. Nguyen, O. Hrinchuk, R. Leary, B. Ginsburg, S. Kriman, S. Beliaev, V . Lavrukhin, J. Cooket al., “NeMo: A Toolkit for Building AI Applications using Neural Modules,”arXiv preprint arXiv:1909.09577, 2019

work page arXiv 1909

[64] [64]

BUT CHiME-7 System Description,

M. Karafiat, K. Vesel ´y, I. Szoke, L. Mosner, K. Benes, M. Witkowski, R. G. Barchi, and L. D. Pepino, “BUT CHiME-7 System Description,” inProc. CHiME, 2023, pp. 67–72

work page 2023

[65] [65]

The NPU System for DASR Task of CHiME-7 Challenge,

B. Mu, P. Guo, H. Wang, Y . Li, Y . Li, P. Zhou, W. Chen, and L. Xie, “The NPU System for DASR Task of CHiME-7 Challenge,” inProc. CHiME, 2023, pp. 63–66

work page 2023

[66] [66]

The University of Cambridge System for the CHiME-7 DASR Task,

K. Deng, X. Zheng, and P. Woodland, “The University of Cambridge System for the CHiME-7 DASR Task,” inProc. CHiME, 2023, pp. 73– 76

work page 2023

[67] [67]

NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge,

N. Kamo, N. Tawara, A. Andoet al., “NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge,” inProc. CHiME, 2024, pp. 69–74

work page 2024

[68] [68]

STCON System for the CHiME-8 Challenge,

A. Mitrofanov, T. Prisyach, T. Timofeevaet al., “STCON System for the CHiME-8 Challenge,” inProc. CHiME, 2024, pp. 13–17

work page 2024

[69] [69]

The Second DIHARD Diarization Challenge: Dataset, Task, and Baselines,

N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, and M. Liberman, “The Second DIHARD Diarization Challenge: Dataset, Task, and Baselines,” inProc. Interspeech, 2019, pp. 978–982

work page 2019

[70] [70]

pyannote.audio: Neural Building Blocks for Speaker Diarization,

H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeuxet al., “pyannote.audio: Neural Building Blocks for Speaker Diarization,” inProc. ICASSP, 2020, pp. 7124–7128

work page 2020

[71] [71]

Integrating End-to-End Neural and Clustering-Based Diarization: Getting The Best of Both Worlds,

K. Kinoshita, M. Delcroix, and N. Tawara, “Integrating End-to-End Neural and Clustering-Based Diarization: Getting The Best of Both Worlds,” inProc. ICASSP, 2021, pp. 7198–7202

work page 2021

[72] [72]

Boosting Unknown-Number Speaker Separation with Transformer Decoder-Based Attractor,

Y . Lee, S. Choi, B. Y . Kim, Z. Q. Wang, and S. Watanabe, “Boosting Unknown-Number Speaker Separation with Transformer Decoder-Based Attractor,” inProc. ICASSP, 2024, pp. 446–450. Zhong-Qiu Wangreceived the B.E. degree in com- puter science and technology from Harbin Institute of Technology, Harbin, China, in2013, and the Ph.D. degree in computer science...

work page 2024