pith. sign in

arxiv: 2606.21215 · v1 · pith:QDKOM2IEnew · submitted 2026-06-19 · 📡 eess.AS · cs.AI· cs.SD

Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach

Pith reviewed 2026-06-26 13:25 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.SD
keywords speaker verificationnon-verbal vocalizationsmixture of expertsknowledge distillationcontrastive lossECAPA-TDNNData2Vecdomain adaptation
0
0 comments X

The pith

A Mixture of Experts module with conditional distillation reduces speaker verification error on non-verbal vocalizations from 38.93% to 22.66% EER while also lowering speech EER from 13.17% to 9.24%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that standard speaker verification models generalize poorly from speech to non-verbal vocalizations and that direct fine-tuning on NVV data erases prior speech accuracy. It combines frozen Data2Vec features inside an ECAPA-TDNN backbone with a Mixture of Experts layer whose routing learns to separate speech and NVV domains. A distillation loss applied only to speech inputs from a pretrained teacher keeps speech performance intact, while a contrastive loss pulls speech and NVV embeddings closer. The resulting model achieves the stated EER reductions across ten NVV types. This matters because modern TTS and voice conversion systems now produce non-verbal sounds whose speaker identity must be checked for consistency.

Core claim

The authors show that a frozen Data2Vec front-end feeding an ECAPA-TDNN, augmented by a Mixture of Experts layer with learned domain-aware routing, plus a conditional distillation loss on speech and a contrastive loss across domains, simultaneously improves cross-domain verification and protects in-domain speech accuracy. On the tested data this yields the measured drops in equal error rate for both speech-NVV pairs and speech alone.

What carries the argument

Mixture of Experts module with learned domain-aware routing, used together with a conditional distillation loss applied only to speech inputs.

If this is right

  • A single model can now verify speaker identity for both spoken words and non-verbal sounds without separate training runs.
  • Adapting to NVVs no longer requires trading away speech verification accuracy.
  • Expressive TTS and voice conversion outputs can be checked for identity consistency across verbal and non-verbal segments with one system.
  • The same routing and loss combination works across all ten NVV categories examined without per-category models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The routing mechanism could be tested on other audio domain shifts such as accented speech or noisy environments to see whether the same gains appear.
  • Adding more NVV categories beyond the original ten would provide a direct check on whether the reported generalization holds.
  • The framework could be inserted into existing TTS evaluation pipelines to enforce identity consistency as a standard metric.

Load-bearing premise

The domain-aware routing learned inside the Mixture of Experts and the contrastive alignment will continue to work on NVV types and acoustic conditions different from the ten types used in training.

What would settle it

Running the trained model on a fresh collection of non-verbal vocalization recordings made under different acoustic conditions and measuring whether the speech-NVV EER remains at or below 22.66%.

Figures

Figures reproduced from arXiv: 2606.21215 by Hsin-Yen Sung, Huang-Cheng Chou, Hung-yi Lee, Kuan-Yu Chen, Shrikanth Narayanan, Tzu-Chieh Wei, Yi-Cheng Lin.

Figure 1
Figure 1. Figure 1: Overall pipeline of proposed method. 2. Methodology As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of IR-MoE expert count on verification perfor￾mance. The 4-expert configuration achieves the best NvS EER, while SvS EER continues to decrease with more experts. with 1024 channels in the convolutional frame-level layers, pro￾ducing 192-dimensional speaker embeddings. The loss weights are set to λDist = 10, λSupCon = 0.3, λBal = 0.1, λIntra = 0.05, and λInter = 0.05. The temperature parameter τ is s… view at source ↗
Figure 2
Figure 2. Figure 2: Cosine similarity distributions illustrating the domain gap between speech and vocal bursts. Target speech–vocal burst trials (dashed orange) exhibit substantial overlap with non-target distributions, indicating an increased risk of false rejections in ECAPA-TDNN. 3.3. Training Details The model parameters are optimized using the Adam optimizer with a weight decay of 1 × 10−4 . We adopt a cosine annealing … view at source ↗
read the original abstract

As expressive text-to-speech (TTS) and voice conversion (VC) systems increasingly generate non-verbal vocalizations (NVVs) to enhance naturalness, reliable speaker verification (SV) becomes essential to objectively assess identity consistency across both verbal and non-verbal segments. Yet current SV systems generalize poorly to NVVs, and fine-tuning on NVV data causes catastrophic forgetting of speech performance. We present the first systematic study across 10 NVV types and propose a framework combining frozen Data2Vec self-supervised features with ECAPA-TDNN, enhanced by a Mixture of Experts (MoE) module with learned domain-aware routing. A conditional distillation loss on speech inputs via a pretrained teacher retains speech-to-speech accuracy, while a contrastive loss bridges the speech-NVV domain gap. Our method reduces speech-NVV EER from 38.93% to 22.66% over a pretrained baseline, and improves speech EER from 13.17% to 9.24% via distillation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents the first systematic study of speaker verification across 10 NVV types. It proposes a framework that combines frozen Data2Vec features with an ECAPA-TDNN backbone, augmented by a Mixture of Experts module employing learned domain-aware routing, a conditional distillation loss from a pretrained teacher on speech inputs, and a contrastive loss to bridge the speech-NVV domain gap. The central empirical claims are EER reductions from 38.93% to 22.66% on speech-NVV and from 13.17% to 9.24% on speech.

Significance. If the reported gains prove robust, the work would be significant for speaker verification in expressive TTS and voice conversion pipelines, where reliable identity consistency across verbal and non-verbal segments is required. The combination of self-supervised features, conditional distillation to avoid catastrophic forgetting, and MoE routing offers a concrete template for domain adaptation in this setting.

major comments (3)
  1. [§5] §5 (experimental results): No evaluation on held-out NVV types, different recording conditions, or out-of-distribution vocalizations is reported. This is load-bearing for the claim that the domain-aware MoE router and contrastive alignment successfully generalize beyond the specific 10 NVV types and conditions used in training.
  2. [§4.3] §4.3 (MoE module description): The domain-aware supervision signal for the routing network is not accompanied by any analysis of routing entropy, expert utilization statistics, or ablation on held-out classes, leaving open whether the router collapses to type-specific experts.
  3. [experimental results tables] Table reporting EER numbers (likely Table 2 or 3): The headline improvements are given as single-point estimates without dataset split details, number of runs, variance across seeds, or statistical significance tests, which undermines assessment of whether the gains over the pretrained baseline are reliable.
minor comments (2)
  1. [Abstract, §3] The abstract and §3 could more explicitly list the 10 NVV types and the source datasets used for training and evaluation.
  2. [§4.4] Notation for the contrastive loss temperature and distillation weight should be introduced once and used consistently across equations and text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§5] §5 (experimental results): No evaluation on held-out NVV types, different recording conditions, or out-of-distribution vocalizations is reported. This is load-bearing for the claim that the domain-aware MoE router and contrastive alignment successfully generalize beyond the specific 10 NVV types and conditions used in training.

    Authors: We agree that evaluations on held-out NVV types would provide stronger support for generalization claims. The current study systematically evaluates the 10 NVV types present in the dataset. In the revised manuscript, we will add an ablation experiment that holds out one NVV type during training and reports performance on the unseen type to directly test the router and alignment components. We will also explicitly discuss the scope and limitations regarding recording conditions and OOD vocalizations. revision: yes

  2. Referee: [§4.3] §4.3 (MoE module description): The domain-aware supervision signal for the routing network is not accompanied by any analysis of routing entropy, expert utilization statistics, or ablation on held-out classes, leaving open whether the router collapses to type-specific experts.

    Authors: We concur that routing analysis is needed to substantiate the MoE design. The revised manuscript will include new figures and text in §4.3 reporting routing entropy, per-expert utilization statistics across the 10 NVV types, and an ablation that holds out classes during training to demonstrate that the router learns domain-aware rather than purely type-specific behavior. revision: yes

  3. Referee: [experimental results tables] Table reporting EER numbers (likely Table 2 or 3): The headline improvements are given as single-point estimates without dataset split details, number of runs, variance across seeds, or statistical significance tests, which undermines assessment of whether the gains over the pretrained baseline are reliable.

    Authors: We appreciate this observation on reproducibility. The reported EERs reflect a single training run. In the revision we will add dataset split details, report means and standard deviations over multiple random seeds, and include statistical significance tests comparing our method against the baseline. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains measured against external baselines and pretrained teacher.

full rationale

The paper describes a standard ML pipeline: frozen Data2Vec features + ECAPA-TDNN backbone, MoE with domain-aware routing, conditional distillation from a pretrained teacher, and contrastive loss. Reported EER reductions (speech-NVV 38.93%→22.66%, speech 13.17%→9.24%) are obtained by direct comparison to an external pretrained baseline. No equations, fitted parameters, or self-citations are shown to reduce these quantities to quantities defined inside the paper. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions that frozen self-supervised features transfer to NVVs and that a contrastive loss can bridge the domain gap; no new entities are postulated.

free parameters (2)
  • MoE routing network weights
    Learned parameters that decide which expert processes each input domain.
  • Distillation temperature or loss weight
    Hyperparameter balancing speech retention against NVV adaptation.
axioms (2)
  • domain assumption Frozen Data2Vec features remain informative for non-verbal vocalizations
    The method freezes the feature extractor and assumes its representations generalize without fine-tuning.
  • domain assumption The 10 NVV types studied are representative of real-world non-verbal vocalizations
    Generalization claim depends on this coverage assumption.

pith-pipeline@v0.9.1-grok · 5739 in / 1352 out tokens · 26692 ms · 2026-06-26T13:25:21.717296+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 8 canonical work pages · 3 internal anchors

  1. [1]

    Introduction Modern Text-to-Speech (TTS) [1, 2, 3, 4] and V oice Con- version (VC) [5] systems achieve near-human naturalness for modal speech, but generating realistic non-verbal vocalizations (NVVs) such as laughter, coughing, and breathing remains an open challenge. Since these generated sounds must maintain speaker identity, objective evaluation deman...

  2. [2]

    Methodology As illustrated in Figure 1, our proposed framework comprises three core components: (1) a frozen Data2Vec and ECAPA- TDNN backbone; (2) an MoE module for domain-aware feature routing; and (3) a multi-objective training strategy integrating conditional distillation, contrastive bridging, and domain-aware routing constraints. 2.1. Backbone Archi...

  3. [3]

    Database and Pre-processing To systematically investigate the domain mismatch in SV mod- els, we utilize the NonverbalTTS [21] dataset for our experi- mental evaluations

    Experiment 3.1. Database and Pre-processing To systematically investigate the domain mismatch in SV mod- els, we utilize the NonverbalTTS [21] dataset for our experi- mental evaluations. It comprises 17 hours of audio split into training (1,314 speakers), validation (46 speakers), and test (147 speakers) partitions, with no speaker overlap across splits. ...

  4. [4]

    Specifically, each batch is composed of 16 unique speak- ers, where each speaker contributes exactly 8 utterances (a mix of speech and NVV utterances). Furthermore, during the loss computation, our implementation explicitly prioritizes cross- domain positive pairs (same speaker, different modality) and falls back to intra-domain positive pairs (same speak...

  5. [5]

    Zero-shot

    Results and Analyses This section systematically evaluates the performance of stan- dard SV models alongside our proposed framework. Specif- ically, we first quantify the domain mismatch that degrades zero-shot verification performance on non-verbal vocalizations. We then analyze the transfer learning dilemma and demonstrate how our conditional distillati...

  6. [6]

    We demonstrate that standard SV models suffer severe acoustic mismatch and catastrophic forgetting when adapted to NVVs

    Conclusion and Future Work This paper addresses a critical blind spot in current speech re- search by presenting the first systematic study of speaker iden- tity verification across 10 distinct NVV types. We demonstrate that standard SV models suffer severe acoustic mismatch and catastrophic forgetting when adapted to NVVs. To overcome this, we proposed a...

  7. [7]

    Acknowledgments We acknowledge the National Center for High-Performance Computing (NCHC) of the National Institutes of Applied Re- search (NIAR) in Taiwan for providing computing resources. Additionally, this work was supported by the Ministry of Education (MOE) of Taiwan under the project Taiwan Cen- ters of Excellence in Artificial Intelligence, through...

  8. [8]

    The authors remain solely responsible for the research design, experiments, analysis, and reported results

    Generative AI Use Disclosure Generative AI tools assisted in polishing the manuscript’s lan- guage. The authors remain solely responsible for the research design, experiments, analysis, and reported results. AI tools did not contribute to the substantive scientific content

  9. [9]

    NaturalSpeech 3: zero-shot speech synthesis with factorized codec and diffusion models,

    Z. Ju, Y . Wang, K. Shen, X. Tan, D. Xin, D. Yang, Y . Liu, Y . Leng, K. Song, S. Tang, Z. Wu, T. Qin, X.-Y . Li, W. Ye, S. Zhang, J. Bian, L. He, J. Li, and S. Zhao, “NaturalSpeech 3: zero-shot speech synthesis with factorized codec and diffusion models,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

  10. [10]

    CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

    Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shi, K. An, G. Yang, Y . Li, Y . Chen, Z. Gao, Q. Chen, Y . Gu, M. Chen, Y . Chen, S. Zhang, W. Wang, and J. Ye, “CosyV oice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training,” 2025. [Online]. Available: https://arxiv.org/abs/2505.17589

  11. [11]

    Breezyvoice: Adapting tts for taiwanese mandarin with enhanced polyphone disambiguation – challenges and insights,

    C.-J. Hsu, Y .-C. Lin, C.-C. Lin, W.-C. Chen, H. L. Chung, C.-A. Li, Y .-C. Chen, C.-Y . Yu, M.-J. Lee, C.-C. Chen, R.-H. Huang, H. yi Lee, and D.-S. Shiu, “Breezyvoice: Adapting tts for taiwanese mandarin with enhanced polyphone disambiguation – challenges and insights,” 2025. [Online]. Available: https://arxiv.org/abs/2501.17790

  12. [12]

    Emotion-aligned generation in diffusion text to speech models via preference- guided optimization,

    J. Shi, H. Du, Y . He, Y . A. Hong, and Y . Gao, “Emotion-aligned generation in diffusion text to speech models via preference- guided optimization,” inICASSP 2026-2026 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2026, pp. 16 492–16 496

  13. [13]

    Zero-shot voice conversion with diffusion transformers,

    S. Liu, “Zero-shot V oice Conversion with Diffusion Transform- ers,” 2024. [Online]. Available: https://arxiv.org/abs/2411.09943

  14. [14]

    WavLM: Large- Scale Self-Supervised Pre-Training for Full Stack Speech Pro- cessing,

    S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large- Scale Self-Supervised Pre-Training for Full Stack Speech Pro- cessing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

  15. [15]

    Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,

    W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdi- nov, and A. Mohamed, “Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,”IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021

  16. [16]

    ECAPA- TDNN: Emphasized Channel Attention, Propagation and Ag- gregation in TDNN Based Speaker Verification,

    B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA- TDNN: Emphasized Channel Attention, Propagation and Ag- gregation in TDNN Based Speaker Verification,” inInterspeech 2020, 2020, pp. 3830–3834

  17. [17]

    V oxCeleb: A Large- Scale Speaker Identification Dataset,

    A. Nagrani, J. S. Chung, and A. Zisserman, “V oxCeleb: A Large- Scale Speaker Identification Dataset,” inInterspeech 2017, 2017, pp. 2616–2620

  18. [18]

    Speaker recognition with cough, laugh and

    M. Zhang, Y . Chen, L. Li, and D. Wang, “Speaker recognition with cough, laugh and ”Wei”,” in2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Confer- ence (APSIPA ASC), 2017, pp. 497–501

  19. [19]

    Laugh Betrays You? Learning Robust Speaker Representation From Speech Containing Non-Verbal Fragments,

    Y . Lin, X. Qin, H. Cui, Z. Zhu, and M. Li, “Laugh Betrays You? Learning Robust Speaker Representation From Speech Containing Non-Verbal Fragments,” 2023. [Online]. Available: https://arxiv.org/abs/2210.16028

  20. [20]

    Perceptual cues in nonverbal vocal expressions of emotion,

    D. A. Sauter, F. Eisner, A. J. Calder, and S. K. Scott, “Perceptual cues in nonverbal vocal expressions of emotion,”Quarterly jour- nal of experimental psychology, vol. 63, no. 11, pp. 2251–2272, 2010

  21. [21]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,”arXiv preprint arXiv:1701.06538, 2017

  22. [22]

    Learning without forgetting,

    Z. Li and D. Hoiem, “Learning without forgetting,”IEEE transac- tions on pattern analysis and machine intelligence, vol. 40, no. 12, pp. 2935–2947, 2017

  23. [23]

    data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language,

    A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language,” inProceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 16...

  24. [24]

    Arcface: Additive angular margin loss for deep face recognition,

    J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, 2019, pp. 4690–4699

  25. [25]

    Switch transformers: Scal- ing to trillion parameter models with simple and efficient spar- sity,

    W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scal- ing to trillion parameter models with simple and efficient spar- sity,”Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022

  26. [26]

    Bootstrap your own latent-a new approach to self-supervised learning,

    J.-B. Grill, F. Strub, F. Altch ´e, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Ghesh- laghi Azaret al., “Bootstrap your own latent-a new approach to self-supervised learning,”Advances in neural information pro- cessing systems, vol. 33, pp. 21 271–21 284, 2020

  27. [27]

    Supervised contrastive learning,

    P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,”Advances in neural information processing systems, vol. 33, pp. 18 661–18 673, 2020

  28. [28]

    Improving speaker representations using contrastive losses on multi-scale features,

    S. Dixit, M. Baali, R. Singh, and B. Raj, “Improving speaker representations using contrastive losses on multi-scale features,” arXiv preprint arXiv:2410.05037, 2024

  29. [29]

    Nonverbaltts: A public english corpus of text-aligned nonverbal vocalizations with emotion annotations for text-to-speech,

    M. Borisov, E. Spirin, and D. Diatlova, “Nonverbaltts: A public english corpus of text-aligned nonverbal vocalizations with emotion annotations for text-to-speech,”arXiv preprint arXiv:2507.13155, 2025

  30. [30]

    Montreal forced aligner: Trainable text-speech align- ment using kaldi

    M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Son- deregger, “Montreal forced aligner: Trainable text-speech align- ment using kaldi.” inInterspeech, vol. 2017, 2017, pp. 498–502

  31. [31]

    voc2vec: A Foundation Model for Non-Verbal V ocalization,

    A. Koudounas, M. La Quatra, S. M. Siniscalchi, and E. Baralis, “voc2vec: A Foundation Model for Non-Verbal V ocalization,” in ICASSP 2025 - 2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

  32. [32]

    Confidence intervals for evaluation in machine learning

    L. Ferrer and P. Riera, “Confidence intervals for evaluation in machine learning.” [Online]. Available: https://github.com/ luferrer/ConfidenceIntervals