arxiv: 2604.27436 · v1 · submitted 2026-04-30 · 📡 eess.AS · eess.IV

Recognition: unknown

BUT System Description for CHiME-9 MCoRec Challenge

Alexander Polok, Dominik Klement, Luk\'a\v{s} Burget, Nguyen Hai Phong, Prachi Singh

Pith reviewed 2026-05-07 10:07 UTC · model grok-4.3

classification 📡 eess.AS eess.IV

keywords multi-talker ASRaudio-visual speech recognitiontarget speakeroverlapped speechCHiME challengespeaker clusteringlong-context ASRvisual conditioning

0 comments

The pith

Conditioning a pre-trained ASR model on AV-HuBERT visual features enables accurate target-speaker transcription and grouping in heavily overlapped long-form conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a system for the CHiME-9 MCoRec challenge that addresses multi-talker ASR where heavy speech overlap makes it hard to identify and transcribe a target speaker from audio alone. It conditions NVIDIA Parakeet-v2 on visual representations from a pre-trained AV-HuBERT model to handle long recordings in one pass, then uses the Qwen3.5-122B LLM to measure transcript topic similarity for hierarchical clustering of participants into conversation groups. This yields 33.7 percent WER and 0.97 clustering F1 on the development set, beating the baseline by 16.2 percent WER and 0.15 F1. The approach shows how visual cues can resolve speaker ambiguity in realistic conversational settings where audio is insufficient.

Core claim

A long-context target-speaker AV-ASR model that conditions a pre-trained Parakeet-v2 ASR on visual representations extracted from a pre-trained AV-HuBERT model can transcribe heavily overlapped audio-visual recordings of parallel conversations in a single decoding pass; combining the resulting transcripts with LLM-estimated topic similarity followed by hierarchical agglomerative clustering produces high-accuracy grouping of participants into conversational clusters.

What carries the argument

The visual-conditioning mechanism that feeds AV-HuBERT representations into the Parakeet-v2 decoder to identify the target speaker amid overlap, paired with LLM-based topic similarity for downstream clustering.

If this is right

Single-pass processing of long-form multi-talker audio-visual recordings becomes practical without repeated decoding.
Visual cues from pre-trained models can measurably lower word error rates when audio overlap prevents reliable speaker selection.
LLM topic similarity followed by agglomerative clustering offers a scalable way to recover conversational groups from transcripts.
Pre-trained ASR and visual speech models can be combined for conversational analysis tasks without task-specific retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditioning approach could be tested on other pre-trained ASR backbones to check whether the visual gain generalizes beyond Parakeet-v2.
If the clustering step is replaced by direct speaker embedding comparison, one could measure whether the LLM topic route adds value beyond acoustic diarization.
Extending the single-pass long-context window further might reveal limits on how much visual context the model can usefully exploit.

Load-bearing premise

Visual representations from a pre-trained AV-HuBERT model can be used directly to condition a pre-trained Parakeet-v2 ASR model for target-speaker identification in long overlapped recordings without further adaptation or fusion details.

What would settle it

Running the same Parakeet-v2 decoder on the MCoRec development set with the visual conditioning branch removed or replaced by random vectors and observing no WER improvement or a drop below the baseline 49.9 percent.

read the original abstract

Multi-talker automatic speech recognition (ASR) in conversational recordings remains an open problem, particularly in scenarios with large portion of overlapping speech where identifying and transcribing a target speaker is difficult from audio alone. Visual cues can help resolve speaker ambiguity, yet their integration into long-context audio-visual (AV) ASR systems has been limited. The CHiME-9 MCoRec task addresses this challenge by requiring transcription of audio-visual recordings of heavily-overlapped parallel conversations, followed by clustering the participants into conversational groups. In this work, we present the BUT system based on a long-context target-speaker AV-ASR model capable of processing long-form recordings in a single decoding pass. Our architecture conditions a pre-trained NVIDIA Parakeet-v2 ASR model on visual representations from a pre-trained AV-HuBERT model. To cluster participants into conversation groups, we employ Qwen3.5-122B LLM to estimate transcript topic similarity followed by hierarchical agglomerative clustering. On the MCoRec development set, the proposed system achieves 33.7% WER and a clustering F1 score of 0.97, improving over the official baseline by 16.2% WER and 0.15 F1 absolute. On the eval set, our team ranked second, being 0.16% WER and 0.5% F1 worse than the best system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BUT's CHiME-9 entry delivers solid baseline-beating numbers on MCoRec by combining Parakeet-v2, AV-HuBERT, and Qwen3.5 clustering, but leaves the visual conditioning step completely unspecified.

read the letter

The BUT system description reports a practical win on the CHiME-9 MCoRec task. On the development set they reach 33.7% WER and 0.97 clustering F1, which beats the official baseline by 16.2% WER and 0.15 F1. They place second on the eval set, only 0.16% WER and 0.5% F1 behind the winner. The architecture runs a long-context target-speaker AV-ASR pass that conditions a frozen Parakeet-v2 model on AV-HuBERT visual features, then uses Qwen3.5 to score transcript topic similarity before hierarchical clustering. That combination is what they actually ship and measure.

Referee Report

2 major / 1 minor

Summary. The manuscript describes the BUT team's system for the CHiME-9 MCoRec Challenge on multi-talker audio-visual ASR and conversational group clustering. It proposes a long-context target-speaker AV-ASR model that conditions a frozen pre-trained NVIDIA Parakeet-v2 ASR on visual representations extracted from a pre-trained AV-HuBERT model, combined with Qwen3.5-122B LLM-based transcript topic similarity followed by hierarchical agglomerative clustering. Concrete results are given: 33.7% WER and 0.97 clustering F1 on the development set (16.2% WER and 0.15 F1 absolute gains over the official baseline), with a second-place ranking on the evaluation set (0.16% WER and 0.5% F1 behind the top system).

Significance. If the performance numbers hold under scrutiny, the work demonstrates that pre-trained visual embeddings can substantially aid target-speaker selection in heavily overlapped, long-form multi-party recordings where audio alone is ambiguous. The LLM-driven clustering step yields near-perfect F1, suggesting a practical way to recover conversation structure from transcripts. These empirical gains on a held-out challenge set are noteworthy for AV-ASR, but the absence of mechanistic details and ablations reduces the ability to generalize or build upon the approach.

major comments (2)

[System description / AV-ASR model] The description of the core AV-ASR architecture (conditioning Parakeet-v2 on AV-HuBERT visual features) provides no information on the fusion operation, temporal alignment procedure for long-form multi-speaker video, or any adaptation of the frozen backbones. This detail is load-bearing for the central claim of a 16.2% absolute WER improvement, as the gain cannot be isolated from unmentioned factors such as beam search, LM rescoring, or data augmentation.
[Experimental results] No ablation studies, error analysis, or component-wise breakdowns are reported on the development set. Without these, it is impossible to attribute the reported WER and F1 gains specifically to the visual conditioning or the long-context single-pass design versus other implementation choices.

minor comments (1)

[Results] The abstract and results section state concrete dev-set numbers and baseline comparisons but do not reference any tables or figures that would allow direct verification of the clustering F1 computation or WER breakdown by overlap condition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our CHiME-9 MCoRec system description. We address each major comment below and will revise the manuscript accordingly to improve technical detail and interpretability.

read point-by-point responses

Referee: The description of the core AV-ASR architecture (conditioning Parakeet-v2 on AV-HuBERT visual features) provides no information on the fusion operation, temporal alignment procedure for long-form multi-speaker video, or any adaptation of the frozen backbones. This detail is load-bearing for the central claim of a 16.2% absolute WER improvement, as the gain cannot be isolated from unmentioned factors such as beam search, LM rescoring, or data augmentation.

Authors: We agree that the current high-level description of the AV-ASR model lacks sufficient implementation details. In the revised manuscript we will expand the relevant section to describe the fusion operation, the temporal alignment procedure for long-form multi-speaker video, and the treatment of the frozen backbones. We will also explicitly document the decoding configuration (including beam search) and confirm that no external LM rescoring or additional data augmentation was used, thereby clarifying the source of the reported gains. revision: yes
Referee: No ablation studies, error analysis, or component-wise breakdowns are reported on the development set. Without these, it is impossible to attribute the reported WER and F1 gains specifically to the visual conditioning or the long-context single-pass design versus other implementation choices.

Authors: We acknowledge that the absence of ablations and error analysis limits the ability to isolate component contributions. As a challenge system paper our emphasis was on final performance, but we agree this reduces interpretability. In the revision we will add ablation results on the development set (comparing the full system against audio-only and non-long-context variants) together with a concise error analysis to better attribute the observed WER and F1 improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system description with direct measurements on held-out data

full rationale

The paper is a challenge system description that reports measured WER (33.7%) and clustering F1 (0.97) on the MCoRec development set after describing an architecture that conditions a frozen pre-trained Parakeet-v2 ASR on pre-trained AV-HuBERT visual features and applies an LLM-based topic similarity plus agglomerative clustering step. No equations, fitted parameters, or first-principles derivations are presented whose outputs reduce to the inputs by construction. The performance numbers are direct empirical evaluations on challenge data rather than quantities derived from self-referential fits or self-citations. Self-citations, if present, are not load-bearing for any claimed derivation. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is limited to the high-level assumptions stated there; no free parameters, invented entities, or detailed axioms are extractable.

axioms (1)

domain assumption Pre-trained AV-HuBERT visual representations can be fused with a pre-trained ASR model to resolve speaker identity in overlapped speech
Invoked by the description of conditioning Parakeet-v2 on AV-HuBERT outputs.

pith-pipeline@v0.9.0 · 5565 in / 1350 out tokens · 94033 ms · 2026-05-07T10:07:10.659228+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 7 canonical work pages · 3 internal anchors

[1]

INTRODUCTION Transcription of multi-party conversations is a fundamental prob- lem in speech processing, complicated by overlapping speech, rapid turn-taking, and reverberant environments [1, 2, 3]. The CHiME- 9 MCoRec task [4] presents a particularly challenging instance of this problem: multiple simultaneous conversations occur within the same recording...
[2]

BUT System Description for CHiME-9 MCoRec Challenge

PROPOSED SYSTEM Our system, illustrated in Figure 1, follows the general pipeline of the official challenge baseline [14, 4], with modifications to both transcription and conversation-group clustering. To enable long- form processing, we fill missing face-detection frames with black frames and decode the concatenated audio-visual input in a sin- gle pass....

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

cocktail party

EXPERIMENTAL SETUP We developed the A V-TS-ASR system using the NeMo toolkit [24]. For the LLM-based clustering pipeline, we utilized the DSPy frame- work [25] to systematically structure prompts and modularize the workflow. 3.1. Data Simulation The proposed system is pre-trained in two stages to progressively adapt the newly-introduced parameters. For th...
[4]

Full Long-form

RESULTS We submitted two systems of the same architecture, both pre-trained on the 1500 h simulated dataset but differing in the fine-tuning data. System 1was fine-tuned on the MCoRec training set and simulated AMI mixtures, whileSystem 2was fine-tuned on the MCoRec train- ing set only. Both systems used the same clustering. Table 1 presents the main resu...
[5]

CONCLUSION We presented a long-form Audio-Visual Target-Speaker ASR system for the CHiME-9 MCoRec task, leveraging the strong pre-trained representations of NVIDIA Parakeet-v2 and A V-HuBERT. By fusing these modalities via a gated adapter and replacing heuristic cluster- ing with an LLM-driven semantic approach, we achieved substantial improvements over t...
[6]

CHiME-6 challenge: Tackling multi- speaker speech recognition for unsegmented recordings,

S. Watanabe et al., “CHiME-6 challenge: Tackling multi- speaker speech recognition for unsegmented recordings,” in 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020), 2020, pp. 1–7

2020
[7]

The CHiME-7 DASR challenge: Distant meeting transcription with multiple devices in diverse scenar- ios,

S. Cornell et al., “The CHiME-7 DASR challenge: Distant meeting transcription with multiple devices in diverse scenar- ios,” in7th International Workshop on Speech Processing in Everyday Environments (CHiME 2023), 2023, pp. 1–6

2023
[8]

NOTSOFAR-1 challenge: New datasets, baseline, and tasks for distant meeting transcription,

A. Vinnikov et al., “NOTSOFAR-1 challenge: New datasets, baseline, and tasks for distant meeting transcription,” inInter- speech 2024, 2024, pp. 5003–5007

2024
[9]

A cocktail-party benchmark: Multi- modal dataset and comparative evaluation results,

T.-B. Nguyen et al., “A cocktail-party benchmark: Multi- modal dataset and comparative evaluation results,” inICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026, pp. 19502– 19506

2026
[10]

BUT/JHU system description for CHiME-8 NOTSOFAR-1 challenge,

A. Polok et al., “BUT/JHU system description for CHiME-8 NOTSOFAR-1 challenge,” inProc. CHiME 2024, 2024, pp. 18–22

2024
[11]

DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition,

A. Polok et al., “DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition,”Computer Speech & Language, p. 101841, 2025

2025
[12]

The USTC-NERCSLIP systems for the CHiME- 8 NOTSOFAR-1 challenge,

S. Niu et al., “The USTC-NERCSLIP systems for the CHiME- 8 NOTSOFAR-1 challenge,” in8th International Workshop on Speech Processing in Everyday Environments (CHiME 2024), 2024, pp. 31–36

2024
[13]

Microphone array geometry-independent multi-talker distant ASR: NTT system for DASR task of the CHiME-8 challenge,

N. Kamo et al., “Microphone array geometry-independent multi-talker distant ASR: NTT system for DASR task of the CHiME-8 challenge,”Computer Speech & Language, vol. 95, pp. 101820, 2026

2026
[14]

STCON system for the CHiME-8 chal- lenge,

A. Mitrofanov et al., “STCON system for the CHiME-8 chal- lenge,” in8th International Workshop on Speech Processing in Everyday Environments (CHiME 2024), ISCA, Sept. 2024, pp. 13–17, ISCA

2024
[15]

Auto-A VSR: Audio-visual speech recognition with automatic labels,

P. Ma et al., “Auto-A VSR: Audio-visual speech recognition with automatic labels,” inICASSP 2023 - 2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

2023
[16]

Whisper-Flamingo: Integrating visual features into whisper for audio-visual speech recognition and translation,

A. Rouditchenko et al., “Whisper-Flamingo: Integrating visual features into whisper for audio-visual speech recognition and translation,” inInterspeech 2024, 2024, pp. 2420–2424

2024
[17]

Lip reading sentences in the wild,

J. S. Chung et al., “Lip reading sentences in the wild,” in2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3444–3453

2017
[18]

LRS3-TED: a large- scale dataset for visual speech recognition,

T. Afouras, J. S. Chung, A. Zisserman, “LRS3-TED: a large- scale dataset for visual speech recognition,”arXiv preprint arXiv:1809.00496, 2018

work page arXiv 2018
[19]

Cocktail-party audio- visual speech recognition,

T.-B. Nguyen, N.-Q. Pham, A. Waibel, “Cocktail-party audio- visual speech recognition,” inInterspeech 2025, 2025, pp. 1828–1832

2025
[20]

Canary-1B- v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST,

M. Sekoyan et al., “Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and high-performance models for multilingual ASR and AST,” arXiv:2509.14128, 2025

work page arXiv 2025
[21]

Audio-visual multi-talker speech recognition in a cocktail party,

Y . Wu et al., “Audio-visual multi-talker speech recognition in a cocktail party,” inInterspeech 2021, ISCA, Aug. 2021, ISCA

2021
[22]

Qwen3 Technical Report

A. Yang et al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review arXiv 2025
[23]

Fast conformer with linearly scalable atten- tion for efficient speech recognition,

D. Rekesh et al., “Fast conformer with linearly scalable atten- tion for efficient speech recognition,” in2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8

2023
[24]

Attention is all you need,

A. Vaswani et al., “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017
[25]

Sigmoid-weighted linear units for neural network function approximation in reinforce- ment learning,

S. Elfwing, E. Uchibe, K. Doya, “Sigmoid-weighted linear units for neural network function approximation in reinforce- ment learning,”Neural networks, vol. 107, pp. 3–11, 2018

2018
[26]

Dropout: a simple way to prevent neural networks from overfitting,

N. Srivastava et al., “Dropout: a simple way to prevent neural networks from overfitting,”The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014

1929
[27]

Layer Normalization

J. L. Ba, J. R. Kiros, G. E. Hinton, “Layer normalization,” arXiv:1607.06450, 2016

work page internal anchor Pith review arXiv 2016
[28]

Efficient sequence transduction by jointly pre- dicting tokens and durations,

H. Xu et al., “Efficient sequence transduction by jointly pre- dicting tokens and durations,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 38462–38484

2023
[29]

Nemo: A toolkit for building ai applications using neural modules.,

O. Kuchaiev et al., “Nemo: A toolkit for building ai applications using neural modules.,”arXiv preprint arXiv:1909.09577, 2019

work page arXiv 1909
[30]

DSPy: Compiling declarative language model calls into self-improving pipelines,

O. Khattab et al., “DSPy: Compiling declarative language model calls into self-improving pipelines,” 2024

2024
[31]

LibriMix: An open-source dataset for generalizable speech separation.arXiv preprint arXiv:2005.11262,

J. Cosentino et al., “Librimix: An open-source dataset for generalizable speech separation,”arXiv preprint arXiv:2005.11262, 2020

work page arXiv 2005
[32]

The AMI meeting corpus: A pre- announcement,

J. Carletta et al., “The AMI meeting corpus: A pre- announcement,” inInternational workshop on machine learn- ing for multimodal interaction. Springer, 2005, pp. 28–39

2005
[33]

Decoupled weight decay regulariza- tion,

I. Loshchilov, F. Hutter, “Decoupled weight decay regulariza- tion,” inInternational Conference on Learning Representa- tions, 2019

2019