MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion

· 2026 · eess.AS · arXiv 2606.09050

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Streaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the effective training sequence length, conversion quality degrades under small-chunk settings, and its timbre encoder directly relies on reference mel-spectrograms, making it sensitive to reference audio quality. To address these limitations we propose MeanVC 2. We introduce future-receptive chunking (FRC), which explicitly schedules past and future receptive fields across diffusion transformer decoder layers and removes clean-chunk teacher forcing. By incorporating bounded future context, FRC enables stable conversion with a 40 ms chunk size. We further introduce a universal timbre token encoder, which constructs a timbre representation from a global speaker embedding and retrieves fine-grained timbre cues via cross-attention, improving robustness to low-quality references and enhancing zero-shot speaker similarity. Experimental results show that MeanVC 2 significantly outperforms MeanVC, while reducing latency from 211 ms to 110 ms. Audio samples are publicly available. The source code will be publicly released.

representative citing papers

MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion

eess.AS · 2026-06-08 · unverdicted · novelty 4.0

MeanVC 2 introduces future-receptive chunking and a universal timbre token encoder to achieve lower-latency and more robust streaming zero-shot voice conversion than the original MeanVC.

citing papers explorer

Showing 1 of 1 citing paper.

MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion eess.AS · 2026-06-08 · unverdicted · none · ref 1 · internal anchor
MeanVC 2 introduces future-receptive chunking and a universal timbre token encoder to achieve lower-latency and more robust streaming zero-shot voice conversion than the original MeanVC.

MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion

fields

years

verdicts

representative citing papers

citing papers explorer