pith. sign in

arxiv: 2605.19695 · v1 · pith:BWI6AXPAnew · submitted 2026-05-19 · 📡 eess.AS · cs.SD

Cross-Talk Speech Reduction, by Separation, for Separation

Pith reviewed 2026-05-20 01:50 UTC · model grok-4.3

classification 📡 eess.AS cs.SD
keywords speech separationcross-talk reductionpseudo-labelsfar-field audioCHiME-6conversational ASRreal data trainingneural networks
0
0 comments X

The pith

Cross-talk reduction on real close-talk recordings produces pseudo-labels that train far-field separation models to new state-of-the-art ASR levels on CHiME-6.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that cross-talk contamination in close-talk microphone signals can be removed by a neural model trained directly on real pairs of close-talk and far-field mixtures. These cleaned signals then act as pseudo-labels for training separation models that operate on the far-field recordings. A sympathetic reader would care because conventional training relies on simulated data that fails to match real room acoustics, speaker movement, and noise. By using only target-domain recordings the method closes that gap and improves downstream automatic speech recognition. The result is the first neural separation system shown to substantially beat guided source separation on genuine conversational speech-in-the-wild.

Core claim

The central claim is that a network called CTRnet, trained end-to-end on real-recorded close-talk and far-field mixture pairs, isolates each speaker's voice from cross-talk interference; the resulting estimates serve as effective pseudo-labels for a second stage, pseudo-label based far-field speech separation, that achieves state-of-the-art ASR word error rates on the CHiME-6 dataset under both oracle and estimated diarization while surpassing all prior CHiME-7 and CHiME-8 submissions.

What carries the argument

CTRnet, a neural separation model trained on real close-talk/far-field pairs to isolate the wearer's speech from cross-talk, whose outputs supply the pseudo-labels for the PuLSS far-field training stage.

If this is right

  • Both CTRnet and the downstream far-field model can be trained entirely on real target-domain recordings without simulation.
  • The framework delivers state-of-the-art ASR under oracle and estimated speaker diarization on CHiME-6.
  • It is the first neural separation approach shown to substantially outperform guided source separation on real conversational data.
  • Close-talk mixtures, previously too noisy for direct use, become usable weak supervision after cross-talk reduction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same real-pair training pattern could be applied to other microphone arrays where partial close-talk signals exist.
  • Integrating CTRnet-style reduction inside an end-to-end diarization-plus-separation pipeline might further reduce error propagation.
  • The pseudo-label strategy suggests a general route for adapting separation models to new acoustic environments using only the deployment hardware.

Load-bearing premise

The speech estimates produced by CTRnet on real close-talk mixtures must be clean enough to improve far-field model training rather than inject harmful label noise.

What would settle it

If ASR word error rate on the CHiME-6 evaluation set rises or stays flat when far-field models are retrained with CTRnet pseudo-labels instead of guided source separation labels, the central claim is false.

Figures

Figures reproduced from arXiv: 2605.19695 by Samuele Cornell, Zhong-Qiu Wang.

Figure 1
Figure 1. Figure 1: Typical setup for collecting training data in conversational scenarios, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: System overview. (a) Training Stage: CTRnet is trained in a semi￾supervised manner on real-recorded pairs of close-talk and far-field mixtures to estimate close-talk speech (see Section IV-D). The estimate is then used as pseudo-labels for training PuLSS in a supervised fashion on real-recorded far￾field mixtures (see Section V-D). In PuLSS, oracle speaker-activity timestamps are used in input features to … view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of unsupervised CTRnet. Best viewed in color. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of sparse and time-varying speaker overlap. Each colored [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of PuLSS. Best viewed in color. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of block-wise inference. tive blocks (i.e., extracting a 12-second block every 1 second), resulting in 123, 339 blocks (∼411 hours) for model training. For the inference of CTRnet and PuLSS, we apply the trained models block-wise to process each session, and stitch the processing results along time. See [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

In conversational speech separation and recognition tasks, close-talk microphones are typically attached to each speaker during training data collection to capture near-field, close-talk mixture signals, in addition to using far-field microphones to record far-field mixture signals. Each such close-talk mixture exhibits a reasonably high energy level for the wearer and could intuitively serve as weak supervision for training far-field speech separation models directly on real-recorded far-field signals. However, they are not sufficiently clean for this purpose, as they often contain strong cross-talk speech from other speakers in addition to background noise. To address this, we propose cross-talk reduction (CTR), a task aiming to isolate the wearer's speech from each close-talk mixture, and a novel method called CTRnet, which can be trained directly on real-recorded pairs of close-talk and far-field mixtures to accomplish CTR. Building on CTRnet, we further propose pseudo-label based far-field speech separation (PuLSS), which uses CTRnet's estimated clean speech as pseudo-labels to train models for separating far-field mixtures. A key advantage of the proposed framework is that both CTRnet and PuLSS can be trained on real-recorded data from the target domain, addressing the generalization gap commonly observed when models are trained exclusively on simulated data. On the CHiME-6 dataset, our framework achieves state-of-the-art ASR performance under both oracle and estimated speaker diarization, surpassing all CHiME-{7,8} challenge submissions. To our knowledge, it is the first neural speech separation method that substantially outperforms guided source separation on real conversational "speech-in-the-wild" data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes CTRnet, a model trained directly on real-recorded close-talk/far-field mixture pairs to perform cross-talk reduction (CTR) on close-talk signals, followed by PuLSS which uses the resulting estimates as pseudo-labels to train far-field speech separation. On CHiME-6, the framework reports state-of-the-art ASR word error rates under both oracle and estimated diarization, surpassing prior CHiME-7/8 submissions and guided source separation on real conversational data.

Significance. If the central claims hold after verification of pseudo-label quality, the work would be significant for demonstrating that real-domain training via auxiliary close-talk microphones can close the simulation-to-real gap in speech separation and recognition. The two-stage design and explicit use of real pairs address a persistent practical limitation; credit is due for focusing on held-out real evaluation rather than simulated data alone.

major comments (3)
  1. [§3.2] §3.2 (PuLSS description): The claim that CTRnet outputs from real close-talk mixtures provide effective pseudo-labels for far-field training is load-bearing for the reported ASR gains, yet no quantitative verification (SI-SDR, PESQ, or oracle ASR delta on held-out real segments) is supplied to show the estimates are sufficiently clean rather than noisy; residual cross-talk or artifacts could explain or undermine the improvements over guided source separation.
  2. [§4] §4 (Experimental results): The SOTA ASR numbers on CHiME-6 under estimated diarization lack ablations isolating the contribution of CTRnet pseudo-labels versus raw close-talk signals or simulated-data baselines, and no statistical significance tests or error bars are reported to confirm the gains are robust rather than dataset-specific.
  3. [§2.2] §2.2 (CTRnet training): The supervision mechanism for training CTRnet on real pairs without clean targets is not fully specified; if the loss relies on far-field mixtures in a way that introduces circular dependence, the pseudo-label step risks reducing to a fitted quantity rather than providing independent supervision.
minor comments (2)
  1. [§3] Notation for the two-stage pipeline (CTRnet then PuLSS) should be introduced with a single diagram or equation block to avoid repeated re-definition across sections.
  2. [Table 1] Table 1 (baseline comparisons) would benefit from explicit column for training data type (real vs. simulated) to highlight the domain-gap advantage claimed in the abstract.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, indicating where we will revise the manuscript to strengthen the presentation and where we provide additional clarification.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (PuLSS description): The claim that CTRnet outputs from real close-talk mixtures provide effective pseudo-labels for far-field training is load-bearing for the reported ASR gains, yet no quantitative verification (SI-SDR, PESQ, or oracle ASR delta on held-out real segments) is supplied to show the estimates are sufficiently clean rather than noisy; residual cross-talk or artifacts could explain or undermine the improvements over guided source separation.

    Authors: We agree that explicit quantitative verification of pseudo-label quality would strengthen the manuscript. In the revision we will add SI-SDR and PESQ results on held-out real close-talk segments (where reference signals permit) together with an oracle-ASR delta obtained by feeding CTRnet outputs directly into the recognizer. These metrics will be reported alongside the existing end-to-end ASR results on CHiME-6 to demonstrate that the pseudo-labels are sufficiently clean to drive the observed gains over guided source separation. revision: yes

  2. Referee: [§4] §4 (Experimental results): The SOTA ASR numbers on CHiME-6 under estimated diarization lack ablations isolating the contribution of CTRnet pseudo-labels versus raw close-talk signals or simulated-data baselines, and no statistical significance tests or error bars are reported to confirm the gains are robust rather than dataset-specific.

    Authors: We will incorporate the requested ablations in the revised manuscript: (i) PuLSS trained on raw close-talk signals without CTRnet, (ii) PuLSS trained exclusively on simulated data, and (iii) the full CTRnet + PuLSS pipeline. We will also report error bars obtained from multiple independent training runs and include paired statistical significance tests on the WER differences to establish that the improvements are robust. revision: yes

  3. Referee: [§2.2] §2.2 (CTRnet training): The supervision mechanism for training CTRnet on real pairs without clean targets is not fully specified; if the loss relies on far-field mixtures in a way that introduces circular dependence, the pseudo-label step risks reducing to a fitted quantity rather than providing independent supervision.

    Authors: Section 2.2 specifies that CTRnet is trained by minimizing a composite loss comprising a reconstruction term on the close-talk output and a cross-domain consistency term that aligns the estimated wearer speech with the corresponding far-field mixture after accounting for acoustic differences. The far-field signal is used only as an auxiliary reference for the shared speech content and is never employed as a direct target for the close-talk output; therefore the supervision remains independent. We will expand the loss formulation with explicit equations in the revision to remove any remaining ambiguity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; training and pseudo-label steps remain independent of final metrics

full rationale

The derivation chain begins with CTRnet trained directly on real-recorded close-talk/far-field pairs to produce cross-talk-reduced estimates, which are then used as pseudo-labels to train PuLSS for far-field separation. No quoted equations, self-citations, or fitted parameters in the abstract or described framework reduce the reported CHiME-6 ASR gains by construction to the input pairs or to a renamed version of the same supervision signal. The pseudo-label quality assumption is an empirical claim subject to verification on held-out data rather than a definitional loop, and the SOTA result is presented as an observed outcome rather than a statistical necessity from the training setup itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claims rest on standard deep-learning assumptions for audio processing plus two new model constructs; no new physical constants or unproven mathematical lemmas are introduced beyond typical neural-network training.

axioms (2)
  • domain assumption Close-talk mixtures contain recoverable information about the wearer's speech that a neural network can isolate from cross-talk.
    Invoked when stating that CTRnet can be trained directly on real close-talk/far-field pairs.
  • domain assumption Pseudo-labels generated by CTRnet are sufficiently accurate to supervise far-field separation training.
    This premise is required for the PuLSS stage to improve rather than degrade performance.
invented entities (2)
  • CTRnet no independent evidence
    purpose: Neural network for cross-talk reduction on close-talk mixtures
    New model introduced to accomplish the CTR task.
  • PuLSS no independent evidence
    purpose: Pseudo-label based far-field speech separation framework
    Overall training pipeline that uses CTRnet outputs.

pith-pipeline@v0.9.0 · 5820 in / 1627 out tokens · 87416 ms · 2026-05-20T01:50:03.816850+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages

  1. [1]

    Comon and C

    P. Comon and C. Jutten,Handbook of Blind Source Separation: Inde- pendent component analysis and applications. Academic press, 2010

  2. [2]

    The Cocktail Party Problem,

    J. H. McDermott, “The Cocktail Party Problem,”Current Biology, vol. 19, no. 22, pp. 1024–1027, 2009

  3. [3]

    Supervised Speech Separation Based on Deep Learning: An Overview,

    D. Wang and J. Chen, “Supervised Speech Separation Based on Deep Learning: An Overview,”IEEE/ACM Trans. Audio, Speech, Lang. Pro- cess., vol. 26, no. 10, pp. 1702–1726, 2018

  4. [4]

    30+ Years of Source Separation Research: Achievements and Future Challenges,

    S. Araki, N. Ito, R. Haeb-Umbach, G. Wichern, Z.-Q. Wang, and Y . Mitsufuji, “30+ Years of Source Separation Research: Achievements and Future Challenges,” inProc. ICASSP, 2025

  5. [5]

    Far-Field Automatic Speech Recognition,

    R. Haeb-Umbach, J. Heymann, L. Drude, S. Watanabe, M. Delcroix, and T. Nakatani, “Far-Field Automatic Speech Recognition,”Proc. IEEE, vol. 109, no. 2, pp. 124–148, 2021

  6. [6]

    Microphone Array Signal Processing and Deep Learning for Speech Enhancement: Combining Model-Based and Data-Driven Ap- proaches to Parameter Estimation and Filtering,

    R. Haeb-Umbach, T. Nakatani, M. Delcroix, C. Boeddeker, and T. Ochiai, “Microphone Array Signal Processing and Deep Learning for Speech Enhancement: Combining Model-Based and Data-Driven Ap- proaches to Parameter Estimation and Filtering,”IEEE Signal Process. Mag., vol. 41, no. 6, pp. 12–23, 2025

  7. [7]

    TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation,

    Z.-Q. Wang, S. Cornell, S. Choi, Y . Lee, B.-Y . Kim, and S. Watanabe, “TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 3221–3236, 2023

  8. [8]

    Closing The Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions,

    W. Zhang, J. Shi, C. Li, S. Watanabe, and Y . Qian, “Closing The Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions,” inProc. WASPAA, 2021, pp. 146–150

  9. [9]

    Real-M: Towards Speech Separation on Real Mixtures,

    C. Subakan, M. Ravanelli, S. Cornell, and F. Grondin, “Real-M: Towards Speech Separation on Real Mixtures,” inProc. ICASSP, 2022, pp. 6862– 6866

  10. [10]

    Summary of the NOTSOFAR-1 challenge: Highlights and learnings,

    I. Abramovski, A. Vinnikov, S. Shaeret al., “Summary of the NOTSOFAR-1 challenge: Highlights and learnings,”Comput. Speech Lang., vol. 93, p. 101796, 2025

  11. [11]

    Recent Trends in Distant Conversational Speech Recogni- tion: A Review of CHiME-7 and 8 DASR Challenges,

    S. Cornell, C. Boeddeker, T. Park, H. Huang, D. Raj, M. Wiesner, Y . Masuyama, X. Chang, Z.-Q. Wang, S. Squartini, P. Garcia, and S. Watanabe, “Recent Trends in Distant Conversational Speech Recogni- tion: A Review of CHiME-7 and 8 DASR Challenges,”Comput. Speech Lang., vol. 97, 2026

  12. [12]

    An End-to-End Integration of Speech Sepa- ration and Recognition with Self-Supervised Learning Representation,

    Y . Masuyama, X. Chang, W. Zhang, S. Cornell, Z.-Q. Wang, N. Ono, Y . Qian, and S. Watanabe, “An End-to-End Integration of Speech Sepa- ration and Recognition with Self-Supervised Learning Representation,” Comput. Speech Lang., vol. 95, p. 101813, 2026

  13. [13]

    The AMI Meeting Corpus: A Pre-Announcement,

    J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemotet al., “The AMI Meeting Corpus: A Pre-Announcement,” inMachine Learning for Multimodal Interaction, 2006, pp. 28–39. 16

  14. [14]

    Summary on The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge,

    F. Yu, S. Zhang, P. Guo, Y . Fu, Z. Duet al., “Summary on The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge,” inProc. ICASSP, 2022, pp. 9156–9160

  15. [15]

    The Fifth ’CHiME’ Speech Separation and Recognition Challenge: Dataset, Task and Baselines,

    J. Barker, S. Watanabe, E. Vincentet al., “The Fifth ’CHiME’ Speech Separation and Recognition Challenge: Dataset, Task and Baselines,” in Proc. Interspeech, 2018, pp. 1561–1565

  16. [16]

    The Multimodal Information Based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition,

    Z. Wang, S. Wu, H. Chenet al., “The Multimodal Information Based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition,” inProc. ICASSP, 2023, pp. 1–5

  17. [17]

    Cross-Talk Reduction,

    Z.-Q. Wang, A. Kumar, and S. Watanabe, “Cross-Talk Reduction,” in Proc. IJCAI, 2024, pp. 5171–5180

  18. [18]

    BLSTM Supported GEV Beamformer Front-End for The 3rd CHiME Challenge,

    J. Heymann, L. Drude, A. Chinaev, and R. Haeb-Umbach, “BLSTM Supported GEV Beamformer Front-End for The 3rd CHiME Challenge,” inProc. ASRU, 2015, pp. 444–451

  19. [19]

    Improved MVDR Beamforming using Single-Channel Mask Prediction Networks,

    H. Erdogan, J. R. Hershey, S. Watanabe, I. Mandel, and J. Le Roux, “Improved MVDR Beamforming using Single-Channel Mask Prediction Networks,” inProc. Interspeech, 2016, pp. 1981–1985

  20. [20]

    How Bad Are Artifacts?: Analyzing The Impact of Speech Enhancement Errors on ASR,

    K. Iwamoto, T. Ochiai, M. Delcroix, R. Ikeshita, H. Sato, S. Araki, and S. Katagiri, “How Bad Are Artifacts?: Analyzing The Impact of Speech Enhancement Errors on ASR,”Proc. Interspeech, pp. 5418–5422, 2022

  21. [21]

    VarArray Meets t-SOT: Advancing The State of The Art of Streaming Distant Conversational Speech Recognition,

    N. Kanda, J. Wu, X. Wang, Z. Chen, J. Li, and T. Yoshioka, “VarArray Meets t-SOT: Advancing The State of The Art of Streaming Distant Conversational Speech Recognition,” inProc. ICASSP, 2023, pp. 1–5

  22. [22]

    Unsupervised Sound Separation using Mixture Invariant Training,

    S. Wisdom, E. Tzinis, H. Erdogan, R. Weiss, K. Wilson, and J. R. Hershey, “Unsupervised Sound Separation using Mixture Invariant Training,”Proc. NeurIPS, vol. 33, pp. 3846–3857, 2020

  23. [23]

    Teacher-Student MixIT for Unsupervised and Semi-supervised Speech Separation,

    J. Zhang, C. Zorila, R. Doddipatla, and J. Barker, “Teacher-Student MixIT for Unsupervised and Semi-supervised Speech Separation,” in Proc. Interspeech, 2021

  24. [24]

    Self-Remixing: Unsupervised Speech Separa- tion via Separation and Remixing,

    K. Saijo and T. Ogawa, “Self-Remixing: Unsupervised Speech Separa- tion via Separation and Remixing,” inProc. ICASSP, 2023, pp. 1–5

  25. [25]

    Unsupervised Multi- Channel Separation and Adaptation,

    C. Han, K. Wilson, S. Wisdom, and J. R. Hershey, “Unsupervised Multi- Channel Separation and Adaptation,” inProc. ICASSP, 2024, pp. 721– 725

  26. [26]

    Adapting Speech Separation to Real-World Meetings using Mixture Invariant Training,

    A. Sivaraman, S. Wisdom, H. Erdogan, and J. R. Hershey, “Adapting Speech Separation to Real-World Meetings using Mixture Invariant Training,” inProc. ICASSP, 2022, pp. 686–690

  27. [27]

    Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In- the-wild Unsupervised Sound Separation,

    S. Wisdom, A. Jansen, R. J. Weiss, H. Erdogan, and J. R. Hershey, “Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In- the-wild Unsupervised Sound Separation,” inProc. WASPAA, 2021, pp. 51–55

  28. [28]

    UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-Determined Training Mixtures,

    Z.-Q. Wang and S. Watanabe, “UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-Determined Training Mixtures,” inProc. NeurIPS, vol. 36, 2023, pp. 34 021–34 042

  29. [29]

    Enhanced Reverberation as Supervision for Unsupervised Speech Separation,

    K. Saijo, G. Wichern, F. G. Germain, Z. Pan, and J. Le Roux, “Enhanced Reverberation as Supervision for Unsupervised Speech Separation,” in Proc. Interspeech, 2024, pp. 607–611

  30. [30]

    Spatial Loss for Unsupervised Multi-channel Source Separation,

    K. Saijo and R. Scheibler, “Spatial Loss for Unsupervised Multi-channel Source Separation,” inProc. Interspeech, 2022, pp. 241–245

  31. [31]

    Front-End Processing for The CHiME-5 Dinner Party Scenario,

    C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Hey- mann, and R. Haeb-Umbach, “Front-End Processing for The CHiME-5 Dinner Party Scenario,” inProc. CHiME, vol. 1, 2018

  32. [32]

    VarArray: Array-Geometry-Agnostic Continuous Speech Separation,

    T. Yoshioka, X. Wang, D. Wang, M. Tang, Z. Zhu, Z. Chen, and N. Kanda, “VarArray: Array-Geometry-Agnostic Continuous Speech Separation,” inProc. ICASSP, 2022, pp. 6027–6031

  33. [33]

    Continuous speech separation: Dataset and analysis,

    Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y . Luo, J. Wu, X. Xiao, and J. Li, “Continuous speech separation: Dataset and analysis,” inProc. ICASSP, 2020, pp. 7284–7288

  34. [34]

    NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription,

    A. Vinnikov, A. Ivry, A. Hurvitzet al., “NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription,” in Proc. Interspeech, 2024

  35. [35]

    Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization,

    T. V on Neumann, C. Boeddeker, T. Cord-Landwehr, M. Delcroix, and R. Haeb-Umbach, “Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization,” inProc. HSCMA, 2024, pp. 775–779

  36. [36]

    Neural fast full-rank spatial covariance analysis for blind source separation,

    Y . Bando, Y . Masuyama, A. A. Nugraha, and K. Yoshii, “Neural fast full-rank spatial covariance analysis for blind source separation,” inProc. EUSIPCO, 2023, pp. 51–55

  37. [37]

    Neural Blind Source Separation and Diarization for Distant Speech Recognition,

    Y . Bando, T. Nakamura, and S. Watanabe, “Neural Blind Source Separation and Diarization for Distant Speech Recognition,” inProc. Interspeech, 2024, pp. 722–726

  38. [38]

    Investigation of Spatial Self-Supervised Learning and Its Application to Target Speaker Speech Recognition,

    Y . Bando, S. Cornell, S. Fukayama, and S. Watanabe, “Investigation of Spatial Self-Supervised Learning and Its Application to Target Speaker Speech Recognition,” inProc. ICASSP, 2025, pp. 1–5

  39. [39]

    ctPuLSE: Close-Talk, and Pseudo-Label Based Far-Field, Speech Enhancement,

    Z.-Q. Wang, “ctPuLSE: Close-Talk, and Pseudo-Label Based Far-Field, Speech Enhancement,”J. Acoust. Soc. Am., vol. 158, no. 4, pp. 2849– 2862, 2025

  40. [40]

    SuperM2M: Supervised and Mixture-to-Mixture Co-Learning for Speech Enhancement and Robust ASR,

    ——, “SuperM2M: Supervised and Mixture-to-Mixture Co-Learning for Speech Enhancement and Robust ASR,”Neural Networks, vol. 188, no. 107408, pp. 1–16, 2025

  41. [41]

    Mixture to Mixture: Leveraging Close-Talk Mixtures as Weak- Supervision for Speech Separation,

    ——, “Mixture to Mixture: Leveraging Close-Talk Mixtures as Weak- Supervision for Speech Separation,”IEEE Signal Process. Lett., vol. 31, pp. 1715–1719, 2024

  42. [42]

    SuPseudo: A Pseudo-supervised Learning Method for Neural Speech Enhancement in Far-field Speech Recogni- tion,

    L. Luo, L. Li, and Q. Hong, “SuPseudo: A Pseudo-supervised Learning Method for Neural Speech Enhancement in Far-field Speech Recogni- tion,” inProc. Interspeech, 2025, pp. 3404–3408

  43. [43]

    Pseudo Labels-Based Neural Speech Enhancement for the A VSR Task in The MISP-Meeting Challenge,

    L. Luo, S. Lu, L. Li, and Q. Hong, “Pseudo Labels-Based Neural Speech Enhancement for the A VSR Task in The MISP-Meeting Challenge,” in Proc. Interspeech, 2025, pp. 1883–1887

  44. [44]

    Relative Transfer Function Iden- tification using Convolutive Transfer Function Approximation,

    R. Talmon, I. Cohen, and S. Gannot, “Relative Transfer Function Iden- tification using Convolutive Transfer Function Approximation,”IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 4, pp. 546–555, 2009

  45. [45]

    A Consolidated Perspective on Multi- Microphone Speech Enhancement and Source Separation,

    S. Gannot, E. Vincentet al., “A Consolidated Perspective on Multi- Microphone Speech Enhancement and Source Separation,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, pp. 692–730, 2017

  46. [46]

    Understanding Blind Deconvolution Algorithms,

    A. Levin, Y . Weiss, F. Durand, and W. T. Freeman, “Understanding Blind Deconvolution Algorithms,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 12, pp. 2354–2367, 2011

  47. [47]

    Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation,

    Z.-Q. Wang, G. Wichern, and J. Le Roux, “Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3476–3490, 2021

  48. [48]

    Differentiable Consistency Constraints for Improved Deep Speech Enhancement,

    S. Wisdom, J. R. Hershey, K. Wilson, J. Thorpe, M. Chinen, B. Patton, and R. A. Saurous, “Differentiable Consistency Constraints for Improved Deep Speech Enhancement,” inProc. ICASSP, 2019, pp. 900–904

  49. [49]

    Multitalker Speech Separation with Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks,

    M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker Speech Separation with Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 10, pp. 1901–1913, 2017

  50. [50]

    Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System De- scription, Comparison, and Analysis,

    D. Raj, P. Denisov, Z. Chen, H. Erdogan, Z. Huang, M. He, S. Watanabe, J. Du, T. Yoshioka, Y . Luoet al., “Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System De- scription, Comparison, and Analysis,” inProc. SLT, 2021, pp. 897–904

  51. [51]

    CHiME-6 Challenge: Tack- ling Multispeaker Speech Recognition for Unsegmented Recordings,

    S. Watanabe, M. Mandel, J. Barkeret al., “CHiME-6 Challenge: Tack- ling Multispeaker Speech Recognition for Unsegmented Recordings,” in Proc. CHiME, 2020, pp. 1–7

  52. [52]

    The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multi- ple Devices in Diverse Scenarios,

    S. Cornell, M. S. Wiesner, S. Watanabe, D. Raj, X. Chang, P. Garcia, Y . Masuyam, Z.-Q. Wang, S. Squartini, and S. Khudanpur, “The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multi- ple Devices in Diverse Scenarios,” inProc. CHiME, 2023, pp. 1–6

  53. [53]

    The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization,

    S. Cornell, T. J. Park, H. Huang, C. Boeddeker, X. Chang, M. Maciejew- ski, M. S. Wiesner, P. Garcia, and S. Watanabe, “The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization,” inProc. CHiME, 2024, pp. 1– 6

  54. [54]

    The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge,

    R. Wang, M. He, J. Duet al., “The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge,” inProc. CHiME, 2023, pp. 13–18

  55. [55]

    The IACAS- Thinkit System for CHiME-7 Challenge,

    L. Ye, H. Lu, G. Cheng, Y . Chen, Z. Shang, and X. Li, “The IACAS- Thinkit System for CHiME-7 Challenge,” inProc. CHiME, 2023, pp. 23–26

  56. [56]

    Multi-Microphone Complex Spec- tral Mapping for Utterance-Wise and Continuous Speech Separation,

    Z.-Q. Wang, P. Wang, and D. Wang, “Multi-Microphone Complex Spec- tral Mapping for Utterance-Wise and Continuous Speech Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 2001– 2014, 2021

  57. [57]

    Librispeech: An ASR Corpus Based on Public Domain Audio Books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR Corpus Based on Public Domain Audio Books,”Proc. ICASSP, pp. 5206–5210, 2015

  58. [58]

    EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation,

    J. Richter, Y . C. Wu, S. Krenn, S. Welker, B. Lay, S. Watanabe, A. Richard, and T. Gerkmann, “EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation,” inProc. Interspeech, 2024, pp. 4873–4877

  59. [59]

    FSD50K: An Open Dataset of Human-Labeled Sound Events,

    E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: An Open Dataset of Human-Labeled Sound Events,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 829–852, 2022

  60. [60]

    A Summary of The REVERB Challenge: State-of-The-Art and Remaining Challenges in Re- verberant Speech Processing Research,

    K. Kinoshita, M. Delcroix, S. Gannotet al., “A Summary of The REVERB Challenge: State-of-The-Art and Remaining Challenges in Re- verberant Speech Processing Research,”Eurasip J. Adv. Signal Process., vol. 2016, no. 1, pp. 1–19, 2016

  61. [61]

    Word Error Rate Definitions and Algorithms for Long-Form Multi- Talker Speech Recognition,

    T. von Neumann, C. Boeddeker, M. Delcroix, and R. Haeb-Umbach, “Word Error Rate Definitions and Algorithms for Long-Form Multi- Talker Speech Recognition,”IEEE Trans. Audio, Speech, Lang. Process., 2025. 17

  62. [62]

    arXiv preprint arXiv:2509.14128 , year =

    M. Sekoyan, N. R. Koluguri, N. Tadevosyan, P. Zelasko, T. Bartley, N. Karpov, J. Balam, and B. Ginsburg, “Canary-1B-v2 & Parakeet-TDT- 0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST,”arXiv preprint arXiv:2509.14128, 2025

  63. [63]

    arXiv preprint arXiv:1909.09577 , year=

    O. Kuchaiev, J. Li, H. Nguyen, O. Hrinchuk, R. Leary, B. Ginsburg, S. Kriman, S. Beliaev, V . Lavrukhin, J. Cooket al., “NeMo: A Toolkit for Building AI Applications using Neural Modules,”arXiv preprint arXiv:1909.09577, 2019

  64. [64]

    BUT CHiME-7 System Description,

    M. Karafiat, K. Vesel ´y, I. Szoke, L. Mosner, K. Benes, M. Witkowski, R. G. Barchi, and L. D. Pepino, “BUT CHiME-7 System Description,” inProc. CHiME, 2023, pp. 67–72

  65. [65]

    The NPU System for DASR Task of CHiME-7 Challenge,

    B. Mu, P. Guo, H. Wang, Y . Li, Y . Li, P. Zhou, W. Chen, and L. Xie, “The NPU System for DASR Task of CHiME-7 Challenge,” inProc. CHiME, 2023, pp. 63–66

  66. [66]

    The University of Cambridge System for the CHiME-7 DASR Task,

    K. Deng, X. Zheng, and P. Woodland, “The University of Cambridge System for the CHiME-7 DASR Task,” inProc. CHiME, 2023, pp. 73– 76

  67. [67]

    NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge,

    N. Kamo, N. Tawara, A. Andoet al., “NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge,” inProc. CHiME, 2024, pp. 69–74

  68. [68]

    STCON System for the CHiME-8 Challenge,

    A. Mitrofanov, T. Prisyach, T. Timofeevaet al., “STCON System for the CHiME-8 Challenge,” inProc. CHiME, 2024, pp. 13–17

  69. [69]

    The Second DIHARD Diarization Challenge: Dataset, Task, and Baselines,

    N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, and M. Liberman, “The Second DIHARD Diarization Challenge: Dataset, Task, and Baselines,” inProc. Interspeech, 2019, pp. 978–982

  70. [70]

    pyannote.audio: Neural Building Blocks for Speaker Diarization,

    H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeuxet al., “pyannote.audio: Neural Building Blocks for Speaker Diarization,” inProc. ICASSP, 2020, pp. 7124–7128

  71. [71]

    Integrating End-to-End Neural and Clustering-Based Diarization: Getting The Best of Both Worlds,

    K. Kinoshita, M. Delcroix, and N. Tawara, “Integrating End-to-End Neural and Clustering-Based Diarization: Getting The Best of Both Worlds,” inProc. ICASSP, 2021, pp. 7198–7202

  72. [72]

    Boosting Unknown-Number Speaker Separation with Transformer Decoder-Based Attractor,

    Y . Lee, S. Choi, B. Y . Kim, Z. Q. Wang, and S. Watanabe, “Boosting Unknown-Number Speaker Separation with Transformer Decoder-Based Attractor,” inProc. ICASSP, 2024, pp. 446–450. Zhong-Qiu Wangreceived the B.E. degree in com- puter science and technology from Harbin Institute of Technology, Harbin, China, in2013, and the Ph.D. degree in computer science...