Grounding Spoken LLMs in Multi-Speaker Audio via Diarization Conditioning

Alexander Polok; Jan \v{C}ernock\'y; Luk\'a\v{s} Burget; Samuele Cornell; Sathvik Udupa; Shinji Watanabe

arxiv: 2606.18134 · v1 · pith:LNDSXX4Tnew · submitted 2026-06-16 · 📡 eess.AS

Grounding Spoken LLMs in Multi-Speaker Audio via Diarization Conditioning

Alexander Polok , Samuele Cornell , Sathvik Udupa , Jan \v{C}ernock\'y , Shinji Watanabe , Luk\'a\v{s} Burget This is my paper

classification 📡 eess.AS

keywords diarizationdixtralgeminivoxtralaudiodecoderencoderfar-field

0 comments

read the original abstract

We propose diarization-conditioned spoken language models (SLMs), a strategy for extending SLMs to far-field multi-talker audio. Rather than adapting the decoder via Serialized Output Training, which risks catastrophic forgetting, we condition the acoustic encoder on diarization masks to extract target-speaker representations, keeping the decoder frozen. We instantiate this as Dixtral, integrating a Diarization Conditioned Whisper (DiCoW) encoder into the Voxtral SLM. On AMI, NOTSOFAR-1, LibriSpeechMix, and Mixer6, Dixtral outperforms Gemini 3.0 Flash, VibeVoice, and Voxtral Mini Transcribe V2 on speaker-attributed transcription by 29.0%, 19.8%, and 16.0% absolute cpWER respectively. On a novel long-form multi-speaker QA benchmark, zero-shot Dixtral matches Gemini on far-field content understanding, and when fine-tuned surpasses both Gemini and Voxtral operating on close-talk across all tasks.

This paper has not been read by Pith yet.

Grounding Spoken LLMs in Multi-Speaker Audio via Diarization Conditioning

discussion (0)