GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model

Azmat Adnan; Boon Siew Han; Eng Siong Chng; Haoyang Li; Shreyas Gopal; Wei Rao; Xuyi Zhuang; Ye Ni; Yuanjin Zheng

arxiv: 2512.20978 · v2 · pith:OMTSAE5Tnew · submitted 2025-12-24 · 📡 eess.AS · cs.AI· cs.LG

GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model

Haoyang Li , Xuyi Zhuang , Azmat Adnan , Ye Ni , Wei Rao , Shreyas Gopal , Eng Siong Chng , Boon Siew Han

show 1 more author

Yuanjin Zheng

This is my paper

classification 📡 eess.AS cs.AIcs.LG

keywords generativegentsespeechtokenslanguagemodelofferingreduce

0 comments

read the original abstract

Language Model (LM)-based generative modeling has emerged as a promising direction for TSE, offering potential for improved generalization and high-fidelity speech. We propose GenTSE, a two-stage decoder-only generative LM for TSE: Stage-1 predicts coarse semantic tokens, and Stage-2 generates fine acoustic tokens. Separating semantics and acoustics stabilizes decoding and yields more accurate target speech. Both stages use continuous SSL or codec embeddings, offering richer context than discretized-prompt methods. To reduce exposure bias, we employ a Frozen-LM Conditioning training strategy that conditions the LMs on predicted tokens from earlier checkpoints to reduce the gap between teacher-forcing training and autoregressive inference. We further apply DPO to better align outputs with perceptual preferences. Experiments on Libri2Mix show that GenTSE surpasses previous LM-based systems in speech quality, intelligibility, and speaker consistency.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Improving Code-Switching ASR with Code-Mixing Guided Synthetic Speech
cs.SD 2026-06 unverdicted novelty 6.0

A code-mixing guided preference-learning method for TTS produces synthetic data that lowers mixed error rate when fine-tuning Whisper on the SEAME Mandarin-English corpus.