Recognition: unknown
BUT System Description for CHiME-9 MCoRec Challenge
Pith reviewed 2026-05-07 10:07 UTC · model grok-4.3
The pith
Conditioning a pre-trained ASR model on AV-HuBERT visual features enables accurate target-speaker transcription and grouping in heavily overlapped long-form conversations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A long-context target-speaker AV-ASR model that conditions a pre-trained Parakeet-v2 ASR on visual representations extracted from a pre-trained AV-HuBERT model can transcribe heavily overlapped audio-visual recordings of parallel conversations in a single decoding pass; combining the resulting transcripts with LLM-estimated topic similarity followed by hierarchical agglomerative clustering produces high-accuracy grouping of participants into conversational clusters.
What carries the argument
The visual-conditioning mechanism that feeds AV-HuBERT representations into the Parakeet-v2 decoder to identify the target speaker amid overlap, paired with LLM-based topic similarity for downstream clustering.
If this is right
- Single-pass processing of long-form multi-talker audio-visual recordings becomes practical without repeated decoding.
- Visual cues from pre-trained models can measurably lower word error rates when audio overlap prevents reliable speaker selection.
- LLM topic similarity followed by agglomerative clustering offers a scalable way to recover conversational groups from transcripts.
- Pre-trained ASR and visual speech models can be combined for conversational analysis tasks without task-specific retraining.
Where Pith is reading between the lines
- The same conditioning approach could be tested on other pre-trained ASR backbones to check whether the visual gain generalizes beyond Parakeet-v2.
- If the clustering step is replaced by direct speaker embedding comparison, one could measure whether the LLM topic route adds value beyond acoustic diarization.
- Extending the single-pass long-context window further might reveal limits on how much visual context the model can usefully exploit.
Load-bearing premise
Visual representations from a pre-trained AV-HuBERT model can be used directly to condition a pre-trained Parakeet-v2 ASR model for target-speaker identification in long overlapped recordings without further adaptation or fusion details.
What would settle it
Running the same Parakeet-v2 decoder on the MCoRec development set with the visual conditioning branch removed or replaced by random vectors and observing no WER improvement or a drop below the baseline 49.9 percent.
read the original abstract
Multi-talker automatic speech recognition (ASR) in conversational recordings remains an open problem, particularly in scenarios with large portion of overlapping speech where identifying and transcribing a target speaker is difficult from audio alone. Visual cues can help resolve speaker ambiguity, yet their integration into long-context audio-visual (AV) ASR systems has been limited. The CHiME-9 MCoRec task addresses this challenge by requiring transcription of audio-visual recordings of heavily-overlapped parallel conversations, followed by clustering the participants into conversational groups. In this work, we present the BUT system based on a long-context target-speaker AV-ASR model capable of processing long-form recordings in a single decoding pass. Our architecture conditions a pre-trained NVIDIA Parakeet-v2 ASR model on visual representations from a pre-trained AV-HuBERT model. To cluster participants into conversation groups, we employ Qwen3.5-122B LLM to estimate transcript topic similarity followed by hierarchical agglomerative clustering. On the MCoRec development set, the proposed system achieves 33.7% WER and a clustering F1 score of 0.97, improving over the official baseline by 16.2% WER and 0.15 F1 absolute. On the eval set, our team ranked second, being 0.16% WER and 0.5% F1 worse than the best system.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes the BUT team's system for the CHiME-9 MCoRec Challenge on multi-talker audio-visual ASR and conversational group clustering. It proposes a long-context target-speaker AV-ASR model that conditions a frozen pre-trained NVIDIA Parakeet-v2 ASR on visual representations extracted from a pre-trained AV-HuBERT model, combined with Qwen3.5-122B LLM-based transcript topic similarity followed by hierarchical agglomerative clustering. Concrete results are given: 33.7% WER and 0.97 clustering F1 on the development set (16.2% WER and 0.15 F1 absolute gains over the official baseline), with a second-place ranking on the evaluation set (0.16% WER and 0.5% F1 behind the top system).
Significance. If the performance numbers hold under scrutiny, the work demonstrates that pre-trained visual embeddings can substantially aid target-speaker selection in heavily overlapped, long-form multi-party recordings where audio alone is ambiguous. The LLM-driven clustering step yields near-perfect F1, suggesting a practical way to recover conversation structure from transcripts. These empirical gains on a held-out challenge set are noteworthy for AV-ASR, but the absence of mechanistic details and ablations reduces the ability to generalize or build upon the approach.
major comments (2)
- [System description / AV-ASR model] The description of the core AV-ASR architecture (conditioning Parakeet-v2 on AV-HuBERT visual features) provides no information on the fusion operation, temporal alignment procedure for long-form multi-speaker video, or any adaptation of the frozen backbones. This detail is load-bearing for the central claim of a 16.2% absolute WER improvement, as the gain cannot be isolated from unmentioned factors such as beam search, LM rescoring, or data augmentation.
- [Experimental results] No ablation studies, error analysis, or component-wise breakdowns are reported on the development set. Without these, it is impossible to attribute the reported WER and F1 gains specifically to the visual conditioning or the long-context single-pass design versus other implementation choices.
minor comments (1)
- [Results] The abstract and results section state concrete dev-set numbers and baseline comparisons but do not reference any tables or figures that would allow direct verification of the clustering F1 computation or WER breakdown by overlap condition.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our CHiME-9 MCoRec system description. We address each major comment below and will revise the manuscript accordingly to improve technical detail and interpretability.
read point-by-point responses
-
Referee: The description of the core AV-ASR architecture (conditioning Parakeet-v2 on AV-HuBERT visual features) provides no information on the fusion operation, temporal alignment procedure for long-form multi-speaker video, or any adaptation of the frozen backbones. This detail is load-bearing for the central claim of a 16.2% absolute WER improvement, as the gain cannot be isolated from unmentioned factors such as beam search, LM rescoring, or data augmentation.
Authors: We agree that the current high-level description of the AV-ASR model lacks sufficient implementation details. In the revised manuscript we will expand the relevant section to describe the fusion operation, the temporal alignment procedure for long-form multi-speaker video, and the treatment of the frozen backbones. We will also explicitly document the decoding configuration (including beam search) and confirm that no external LM rescoring or additional data augmentation was used, thereby clarifying the source of the reported gains. revision: yes
-
Referee: No ablation studies, error analysis, or component-wise breakdowns are reported on the development set. Without these, it is impossible to attribute the reported WER and F1 gains specifically to the visual conditioning or the long-context single-pass design versus other implementation choices.
Authors: We acknowledge that the absence of ablations and error analysis limits the ability to isolate component contributions. As a challenge system paper our emphasis was on final performance, but we agree this reduces interpretability. In the revision we will add ablation results on the development set (comparing the full system against audio-only and non-long-context variants) together with a concise error analysis to better attribute the observed WER and F1 improvements. revision: yes
Circularity Check
No circularity: empirical system description with direct measurements on held-out data
full rationale
The paper is a challenge system description that reports measured WER (33.7%) and clustering F1 (0.97) on the MCoRec development set after describing an architecture that conditions a frozen pre-trained Parakeet-v2 ASR on pre-trained AV-HuBERT visual features and applies an LLM-based topic similarity plus agglomerative clustering step. No equations, fitted parameters, or first-principles derivations are presented whose outputs reduce to the inputs by construction. The performance numbers are direct empirical evaluations on challenge data rather than quantities derived from self-referential fits or self-citations. Self-citations, if present, are not load-bearing for any claimed derivation. The result is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pre-trained AV-HuBERT visual representations can be fused with a pre-trained ASR model to resolve speaker identity in overlapped speech
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Transcription of multi-party conversations is a fundamental prob- lem in speech processing, complicated by overlapping speech, rapid turn-taking, and reverberant environments [1, 2, 3]. The CHiME- 9 MCoRec task [4] presents a particularly challenging instance of this problem: multiple simultaneous conversations occur within the same recording...
-
[2]
BUT System Description for CHiME-9 MCoRec Challenge
PROPOSED SYSTEM Our system, illustrated in Figure 1, follows the general pipeline of the official challenge baseline [14, 4], with modifications to both transcription and conversation-group clustering. To enable long- form processing, we fill missing face-detection frames with black frames and decode the concatenated audio-visual input in a sin- gle pass....
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
cocktail party
EXPERIMENTAL SETUP We developed the A V-TS-ASR system using the NeMo toolkit [24]. For the LLM-based clustering pipeline, we utilized the DSPy frame- work [25] to systematically structure prompts and modularize the workflow. 3.1. Data Simulation The proposed system is pre-trained in two stages to progressively adapt the newly-introduced parameters. For th...
-
[4]
Full Long-form
RESULTS We submitted two systems of the same architecture, both pre-trained on the 1500 h simulated dataset but differing in the fine-tuning data. System 1was fine-tuned on the MCoRec training set and simulated AMI mixtures, whileSystem 2was fine-tuned on the MCoRec train- ing set only. Both systems used the same clustering. Table 1 presents the main resu...
-
[5]
CONCLUSION We presented a long-form Audio-Visual Target-Speaker ASR system for the CHiME-9 MCoRec task, leveraging the strong pre-trained representations of NVIDIA Parakeet-v2 and A V-HuBERT. By fusing these modalities via a gated adapter and replacing heuristic cluster- ing with an LLM-driven semantic approach, we achieved substantial improvements over t...
-
[6]
CHiME-6 challenge: Tackling multi- speaker speech recognition for unsegmented recordings,
S. Watanabe et al., “CHiME-6 challenge: Tackling multi- speaker speech recognition for unsegmented recordings,” in 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020), 2020, pp. 1–7
2020
-
[7]
The CHiME-7 DASR challenge: Distant meeting transcription with multiple devices in diverse scenar- ios,
S. Cornell et al., “The CHiME-7 DASR challenge: Distant meeting transcription with multiple devices in diverse scenar- ios,” in7th International Workshop on Speech Processing in Everyday Environments (CHiME 2023), 2023, pp. 1–6
2023
-
[8]
NOTSOFAR-1 challenge: New datasets, baseline, and tasks for distant meeting transcription,
A. Vinnikov et al., “NOTSOFAR-1 challenge: New datasets, baseline, and tasks for distant meeting transcription,” inInter- speech 2024, 2024, pp. 5003–5007
2024
-
[9]
A cocktail-party benchmark: Multi- modal dataset and comparative evaluation results,
T.-B. Nguyen et al., “A cocktail-party benchmark: Multi- modal dataset and comparative evaluation results,” inICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026, pp. 19502– 19506
2026
-
[10]
BUT/JHU system description for CHiME-8 NOTSOFAR-1 challenge,
A. Polok et al., “BUT/JHU system description for CHiME-8 NOTSOFAR-1 challenge,” inProc. CHiME 2024, 2024, pp. 18–22
2024
-
[11]
DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition,
A. Polok et al., “DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition,”Computer Speech & Language, p. 101841, 2025
2025
-
[12]
The USTC-NERCSLIP systems for the CHiME- 8 NOTSOFAR-1 challenge,
S. Niu et al., “The USTC-NERCSLIP systems for the CHiME- 8 NOTSOFAR-1 challenge,” in8th International Workshop on Speech Processing in Everyday Environments (CHiME 2024), 2024, pp. 31–36
2024
-
[13]
Microphone array geometry-independent multi-talker distant ASR: NTT system for DASR task of the CHiME-8 challenge,
N. Kamo et al., “Microphone array geometry-independent multi-talker distant ASR: NTT system for DASR task of the CHiME-8 challenge,”Computer Speech & Language, vol. 95, pp. 101820, 2026
2026
-
[14]
STCON system for the CHiME-8 chal- lenge,
A. Mitrofanov et al., “STCON system for the CHiME-8 chal- lenge,” in8th International Workshop on Speech Processing in Everyday Environments (CHiME 2024), ISCA, Sept. 2024, pp. 13–17, ISCA
2024
-
[15]
Auto-A VSR: Audio-visual speech recognition with automatic labels,
P. Ma et al., “Auto-A VSR: Audio-visual speech recognition with automatic labels,” inICASSP 2023 - 2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5
2023
-
[16]
Whisper-Flamingo: Integrating visual features into whisper for audio-visual speech recognition and translation,
A. Rouditchenko et al., “Whisper-Flamingo: Integrating visual features into whisper for audio-visual speech recognition and translation,” inInterspeech 2024, 2024, pp. 2420–2424
2024
-
[17]
Lip reading sentences in the wild,
J. S. Chung et al., “Lip reading sentences in the wild,” in2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3444–3453
2017
-
[18]
LRS3-TED: a large- scale dataset for visual speech recognition,
T. Afouras, J. S. Chung, A. Zisserman, “LRS3-TED: a large- scale dataset for visual speech recognition,”arXiv preprint arXiv:1809.00496, 2018
-
[19]
Cocktail-party audio- visual speech recognition,
T.-B. Nguyen, N.-Q. Pham, A. Waibel, “Cocktail-party audio- visual speech recognition,” inInterspeech 2025, 2025, pp. 1828–1832
2025
-
[20]
M. Sekoyan et al., “Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and high-performance models for multilingual ASR and AST,” arXiv:2509.14128, 2025
-
[21]
Audio-visual multi-talker speech recognition in a cocktail party,
Y . Wu et al., “Audio-visual multi-talker speech recognition in a cocktail party,” inInterspeech 2021, ISCA, Aug. 2021, ISCA
2021
-
[22]
A. Yang et al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review arXiv 2025
-
[23]
Fast conformer with linearly scalable atten- tion for efficient speech recognition,
D. Rekesh et al., “Fast conformer with linearly scalable atten- tion for efficient speech recognition,” in2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8
2023
-
[24]
Attention is all you need,
A. Vaswani et al., “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017
2017
-
[25]
Sigmoid-weighted linear units for neural network function approximation in reinforce- ment learning,
S. Elfwing, E. Uchibe, K. Doya, “Sigmoid-weighted linear units for neural network function approximation in reinforce- ment learning,”Neural networks, vol. 107, pp. 3–11, 2018
2018
-
[26]
Dropout: a simple way to prevent neural networks from overfitting,
N. Srivastava et al., “Dropout: a simple way to prevent neural networks from overfitting,”The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014
1929
-
[27]
J. L. Ba, J. R. Kiros, G. E. Hinton, “Layer normalization,” arXiv:1607.06450, 2016
work page internal anchor Pith review arXiv 2016
-
[28]
Efficient sequence transduction by jointly pre- dicting tokens and durations,
H. Xu et al., “Efficient sequence transduction by jointly pre- dicting tokens and durations,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 38462–38484
2023
-
[29]
Nemo: A toolkit for building ai applications using neural modules.,
O. Kuchaiev et al., “Nemo: A toolkit for building ai applications using neural modules.,”arXiv preprint arXiv:1909.09577, 2019
-
[30]
DSPy: Compiling declarative language model calls into self-improving pipelines,
O. Khattab et al., “DSPy: Compiling declarative language model calls into self-improving pipelines,” 2024
2024
-
[31]
J. Cosentino et al., “Librimix: An open-source dataset for generalizable speech separation,”arXiv preprint arXiv:2005.11262, 2020
-
[32]
The AMI meeting corpus: A pre- announcement,
J. Carletta et al., “The AMI meeting corpus: A pre- announcement,” inInternational workshop on machine learn- ing for multimodal interaction. Springer, 2005, pp. 28–39
2005
-
[33]
Decoupled weight decay regulariza- tion,
I. Loshchilov, F. Hutter, “Decoupled weight decay regulariza- tion,” inInternational Conference on Learning Representa- tions, 2019
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.