DNN-based Speaker Embedding Using Subjective Inter-speaker Similarity for Multi-speaker Modeling in Speech Synthesis
Pith reviewed 2026-05-24 19:03 UTC · model grok-4.3
The pith
DNN speaker embeddings trained to match crowdsourced human similarity scores correlate with perception and raise synthesis quality for unseen speakers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that DNN-based speaker embedding models trained either to predict a vector of the subjective inter-speaker similarity matrix or to minimize the squared Frobenius norm between that matrix and the Gram matrix of the d-vectors learn embeddings that are highly correlated with subjective similarity, and that the similarity-vector approach improves the naturalness of DNN-based multi-speaker speech synthesis for open speakers whose utterances were absent from the training set.
What carries the argument
The crowdsourced inter-speaker similarity matrix, used either as direct regression targets for similarity-vector embedding or as the reference for Gram-matrix matching in similarity-matrix embedding.
If this is right
- The similarity-vector embedding can be substituted for d-vectors inside existing multi-speaker DNN synthesizers.
- Training targets derived from human similarity judgments yield embeddings that generalize to open speakers.
- The Gram-matrix matching method also produces embeddings aligned with subjective similarity, though the vector-prediction method shows clearer synthesis gains.
- Large-scale subjective scoring provides usable supervision for learning perceptually grounded speaker representations.
Where Pith is reading between the lines
- The same training procedure could be applied to male speakers or to other languages to test whether the correlation with synthesis quality holds across different populations.
- The approach suggests that future embedding objectives for synthesis should explicitly incorporate perceptual similarity rather than relying solely on classification or reconstruction losses.
- If the subjective matrix captures dimensions relevant to prosody or timbre, the embeddings may also benefit voice-conversion or speaker-adaptation tasks that were not tested in the paper.
Load-bearing premise
The crowdsourced similarity scores from 153 Japanese female speakers serve as reliable ground truth that predicts which embeddings will produce better synthesis for speakers never seen in training.
What would settle it
A new set of listening tests on held-out speakers in which the proposed embeddings produce no measurable improvement in naturalness over conventional d-vectors, or in which the learned embeddings show low correlation with fresh subjective similarity ratings.
Figures
read the original abstract
This paper proposes novel algorithms for speaker embedding using subjective inter-speaker similarity based on deep neural networks (DNNs). Although conventional DNN-based speaker embedding such as a $d$-vector can be applied to multi-speaker modeling in speech synthesis, it does not correlate with the subjective inter-speaker similarity and is not necessarily appropriate speaker representation for open speakers whose speech utterances are not included in the training data. We propose two training algorithms for DNN-based speaker embedding model using an inter-speaker similarity matrix obtained by large-scale subjective scoring. One is based on similarity vector embedding and trains the model to predict a vector of the similarity matrix as speaker representation. The other is based on similarity matrix embedding and trains the model to minimize the squared Frobenius norm between the similarity matrix and the Gram matrix of $d$-vectors, i.e., the inter-speaker similarity derived from the $d$-vectors. We crowdsourced the inter-speaker similarity scores of 153 Japanese female speakers, and the experimental results demonstrate that our algorithms learn speaker embedding that is highly correlated with the subjective similarity. We also apply the proposed speaker embedding to multi-speaker modeling in DNN-based speech synthesis and reveal that the proposed similarity vector embedding improves synthetic speech quality for open speakers whose speech utterances are unseen during the training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes two DNN training algorithms for speaker embedding derived from a crowdsourced inter-speaker similarity matrix collected from 153 Japanese female speakers. Similarity vector embedding trains the model to predict rows of the similarity matrix; similarity matrix embedding minimizes the squared Frobenius norm between the subjective matrix and the Gram matrix of the learned d-vectors. The paper claims the resulting embeddings are highly correlated with subjective similarity and that the vector-embedding variant improves perceptual quality in DNN-based multi-speaker TTS for speakers unseen during training.
Significance. If the central empirical claims hold after proper validation, the work would supply a perceptually motivated alternative to conventional d-vectors that improves generalization to open speakers in TTS. The large-scale subjective data collection is a concrete strength; however, the link between the collected similarity matrix and synthesis-relevant acoustic dimensions remains an untested modeling assumption.
major comments (3)
- [Abstract] Abstract: the claim that the embeddings are 'highly correlated with the subjective similarity' and 'improve synthetic speech quality for open speakers' is presented without any numerical correlation values, baseline comparisons (e.g., standard d-vector or i-vector), data-split details, or statistical tests, preventing assessment of effect size or risk of post-hoc selection.
- [Experiments section (correlation results)] The evaluation of correlation between learned embeddings and subjective scores is performed on the identical crowdsourced matrix used to supervise training; this circularity means the reported 'high correlation' does not yet demonstrate that the embedding captures structure that generalizes beyond the training speakers or to synthesis artifacts.
- [TTS experiments (open-speaker results)] The central claim that the similarity-vector embedding improves TTS quality for unseen speakers rests on the premise that the crowdsourced matrix encodes the perceptual dimensions that control synthesis artifacts (speaker consistency, prosody, etc.); no ablation or direct comparison isolating this alignment versus other factors (identity, recording conditions) is provided.
minor comments (2)
- [Method] The notation distinguishing the conventional d-vector from the proposed similarity-derived embeddings would benefit from an explicit equation or diagram in the method section.
- [Experimental setup] Details on the DNN architecture, optimizer, and hyper-parameters used for both embedding training and the downstream TTS system are missing or insufficiently referenced.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below with clarifications, defenses on substance, and commitments to revisions that strengthen the presentation without altering the core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the embeddings are 'highly correlated with the subjective similarity' and 'improve synthetic speech quality for open speakers' is presented without any numerical correlation values, baseline comparisons (e.g., standard d-vector or i-vector), data-split details, or statistical tests, preventing assessment of effect size or risk of post-hoc selection.
Authors: We agree that the abstract would benefit from explicit numerical support. In the revised manuscript we will incorporate key quantitative results already reported in the experiments section: Pearson correlation coefficients for both proposed methods versus standard d-vector and i-vector baselines, speaker-disjoint data-split details, and reference to the statistical tests used. This will allow readers to assess effect sizes directly from the abstract. revision: yes
-
Referee: [Experiments section (correlation results)] The evaluation of correlation between learned embeddings and subjective scores is performed on the identical crowdsourced matrix used to supervise training; this circularity means the reported 'high correlation' does not yet demonstrate that the embedding captures structure that generalizes beyond the training speakers or to synthesis artifacts.
Authors: The reported correlations confirm that the models successfully recover the subjective similarity structure on which they were trained. To demonstrate generalization, the revised experiments will include a speaker-disjoint protocol: the embedding model will be retrained on a random subset of speakers and correlation will be evaluated on the held-out speakers' rows of the similarity matrix. This addition directly addresses the concern about structure beyond the training speakers. revision: partial
-
Referee: [TTS experiments (open-speaker results)] The central claim that the similarity-vector embedding improves TTS quality for unseen speakers rests on the premise that the crowdsourced matrix encodes the perceptual dimensions that control synthesis artifacts (speaker consistency, prosody, etc.); no ablation or direct comparison isolating this alignment versus other factors (identity, recording conditions) is provided.
Authors: The crowdsourced similarity scores are by definition perceptual; the observed improvement in MOS for open speakers relative to standard d-vectors already provides empirical support that the learned embeddings capture dimensions relevant to synthesis quality. We will expand the discussion to explicitly articulate this modeling assumption and its grounding in subjective data. A full factorial ablation isolating every acoustic factor is beyond the scope of the present study, but the existing baseline comparisons already control for speaker identity. revision: partial
Circularity Check
Empirical training directly targets subjective similarity matrix; correlation follows from objective but TTS gains are independent validation
full rationale
The paper defines two explicit training objectives (similarity vector prediction and Frobenius-norm matching to the Gram matrix) that take the crowdsourced inter-speaker similarity matrix as direct supervision. The claim of 'highly correlated' embeddings is therefore an expected outcome of optimization rather than an independent derivation. No self-citations, uniqueness theorems, or ansatzes are used in a load-bearing way, and the multi-speaker TTS experiments for unseen speakers constitute a separate empirical test. The overall chain remains self-contained against external benchmarks with only minor tautology in the correlation reporting.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Crowdsourced pairwise similarity scores form a reliable and stable representation of human inter-speaker perceptual distance.
- domain assumption Minimizing squared Frobenius distance between the similarity matrix and the Gram matrix of embeddings produces embeddings useful for downstream TTS.
Reference graph
Works this paper leans on
-
[1]
Introduction Statistical parametric speech synthesis (SPSS) [1] is a technique for synthesizing naturally sounding and easily controllab le syn- thetic speech. Recent developments of both training algori thms and acoustic modeling for SPSS using deep neural networks (DNNs) [2] have significantly improved quality of synthetic speech. For instance, training ...
-
[2]
Conventional speaker embedding 2.1. One-hot speaker code A speaker code [11] c = [ c(1), · · ·, c(n), · · ·, c(Ns)]⊤ is the 1-of-Ns representation for identifying the one of closed Ns speakers, which is the discrete representation of speaker iden- tity. The speaker code ci for the ith speaker is defined as fol- lows: ci(n) = { 1 if n = i 0 otherwise (1 ≤ n...
-
[3]
To what degree do the ith speaker’s voice and the jth speaker’s one sound similar?
Proposed speaker embedding Here, we propose two algorithms for learning speaker embed- ding that is highly correlated with the subjective inter-sp eaker similarity. 3.1. Subjective inter-speaker similarity matrix We define a subjective inter-speaker similarity matrix that rep- resents the speaker-pair similarity perceived by listener s. Let S = [ s1, · · ·...
-
[4]
Experimental Evaluation 4.1. Experimental conditions 4.1.1. Conditions for large-scale subjective scoring We conducted large-scale subjective scoring for obtaining the similarity matrix S by using our crowdsourced evaluation sys- tems. We used 153 Japanese female speakers included in the JNAS corpus [22]. Each speaker utters at least 150 reading- style ut...
work page 1904
-
[5]
Conclusion This paper proposed novel algorithms for incorporating sub jec- tive inter-speaker similarity perceived by listeners into the train- ing a speaker embedding model based on deep neural networks (DNNs). The algorithms used an inter-speaker similarity ma - trix obtained from large-scale subjective scoring as a cons traint on training the model. Tw...
-
[6]
Acknowledgements Part of this work was supported by the SECOM Science and Technology Foundation, JSPS KAKENHI Grant Number 18J22090 and 17H06101, the Ministry of Internal Affairs and Communications, and the GAP foundation program of the Uni- versity of Tokyo
-
[7]
Statistical parametric speech synthesis,
H. Zen, K. Tokuda, and A. Black, “Statistical parametric speech synthesis,” Speech Communication, vol. 51, no. 11, pp. 1039– 1064, Nov. 2009
work page 2009
-
[8]
Statistical paramet ric speech synthesis using deep neural networks,
H. Zen, A. Senior, and M. Schuster, “Statistical paramet ric speech synthesis using deep neural networks,” in Proc. ICASSP, V ancou- ver, Canada, May 2013, pp. 7962–7966
work page 2013
-
[9]
Generative a d- versarial nets,
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Ward e- Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative a d- versarial nets,” in Proc. NIPS, Montreal, Canada, Dec. 2014, pp. 2672–2680
work page 2014
-
[10]
Statistical parametric speech synthesis incorporating generative adversarial ne tworks,
Y . Saito, S. Takamichi, and H. Saruwatari, “Statistical parametric speech synthesis incorporating generative adversarial ne tworks,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 26, no. 1, pp. 84–96, Jan. 2018
work page 2018
-
[11]
Y . Saito, S. Takamichi, and H. Saruwatari, “Text-to-spe ech syn- thesis using STFT spectra based on low-/multi-resolution g ener- ative adversarial networks,” in Proc. ICASSP, Alberta, Canada, Apr. 2018, pp. 5299–5303
work page 2018
-
[12]
Y . Zhao, S. Takaki, H.-T. Luong, J. Y amagishi, D. Saito, a nd N. Minematsu, “Wasserstein GAN and waveform loss-based acoustic model training for multi-speaker text-to-speech synthe- sis systems using a WaveNet vocoder,” IEEE Access, vol. 6, pp. 60478–60488, Sep. 2018
work page 2018
-
[13]
Tacotron: Towards End-to-End Speech Synthesis
Y . Wang, RJ Skerry-Ryan, D. Stanton, Y . Wu, R.-J. Weiss, N. Jaitly, Z. Y ang, Y . Xiao, Z. Chen, S. Bengio, Q. Le, Y . Agiomyrgiannakis, R. Clark, and R.-A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” arXiv, vol. abs/1703.10135, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[14]
WaveNet: A Generative Model for Raw Audio
A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw audio,” arXiv, vol. abs/1609.03499, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[15]
Speaker-dependent WaveNet vocoder,
A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda, “Speaker-dependent WaveNet vocoder,” in Proc. IN- TERSPEECH, Stockholm, Sweden, Aug. 2017, pp. 1118–1122
work page 2017
-
[16]
Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,
J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Y ang, Z. Chen, Y . Zhang, Y . Wang, RJ Skerry-Ryan, R. A. Saurous, Y . Agiomyrgiannakis, and Y . Wu, “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in Proc. ICASSP, Calgary, Canada, Apr. 2018, pp. 4779–4783
work page 2018
-
[17]
DNN-based speech synt hesis using speaker codes,
N. Hojo, Y . Ijima, and H. Mizuno, “DNN-based speech synt hesis using speaker codes,” IEICE Transactions on Information and Systems, vol. E101-D, no. 2, pp. 462–472, Feb. 2018
work page 2018
-
[18]
Adapting and controlling DNN-based speech synthesis using input cod es,
H.-T. Luong, S. Takaki, G. E. Henter, and J. Y amagishi, “Adapting and controlling DNN-based speech synthesis using input cod es,” in Proc. ICASSP , New Orleans, U.S.A., May 2017, pp. 1905– 1909
work page 2017
-
[19]
Deep neural networks for small footprint text- dependent speaker verification,
E. V ariani, X. Lei, E. McDermott, I. L. Moreno, and J. Gon zalez- Dominguez, “Deep neural networks for small footprint text- dependent speaker verification,” in Proc. ICASSP, Florence, Italy, May 2014, pp. 4080–4084
work page 2014
-
[20]
Y . Saito, Y . Ijima, K. Nishida, and S. Takamichi, “Non-p arallel voice conversion using variational autoencoders conditio ned by phonetic posteriorgrams and d-vectors,” in Proc. ICASSP , Al- berta, Canada, Apr. 2018, pp. 5274–5278
work page 2018
-
[21]
Auto-Encoding Variational Bayes
D. P . Kingma and M. Welling, “Auto-encoding variational Bayes,” arXiv, vol. abs/1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[22]
D. I. Shuman, S. K. Narang, P . Frossard, A. Ortega, and P . V an- dergheynst, “The emerging field of signal processing on grap hs: Extending high-dimensional data analysis to networks and o ther irregular domains,” IEEE Signal Processing Magazine , vol. 30, no. 3, pp. 83–98, May 2013
work page 2013
-
[23]
Graph embedding techniques, a pplica- tions, and performance: A survey,
P . Goyal and E. Ferrara, “Graph embedding techniques, a pplica- tions, and performance: A survey,” Knowledge-Based Systems , vol. 151, pp. 78–94, Jul. 2018
work page 2018
-
[24]
A tech- nique for controlling voice quality of synthetic speech usi ng mul- tiple regression HSMM,
M. Tachibana, T. Nose, J. Y amagishi, and T. Kobayashi, “ A tech- nique for controlling voice quality of synthetic speech usi ng mul- tiple regression HSMM,” in Proc. ICSLP, Pittsburgh, U.S.A., Sep. 2006, pp. 2438–2441
work page 2006
-
[25]
Regression approaches to voice quality control based on on e-to- many eigenvoice conversion,
K. Ohta, Y . Ohtani, T. Toda, H. Saruwatari, and K. Shikan o, “Regression approaches to voice quality control based on on e-to- many eigenvoice conversion,” in Proc. ICSLP, Bonn, Germany, Aug. 2007, pp. 101–106
work page 2007
-
[26]
J. Lorenzo-Trueba, G. E. Henter, S. Takaki, J. Y amagish i, Y . Morino, and Y . Ochiai, “Investigating different representations for modeling and controlling multiple emotions in DNN-base d speech synthesis,” Speech Communication, vol. 99, pp. 135–143, May 2018
work page 2018
-
[27]
Deep clus- tering: Discriminative embeddings for segmentation and se para- tion,
J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, “Deep clus- tering: Discriminative embeddings for segmentation and se para- tion,” in Proc. ICASSP, Shanghai, China, Mar. 2016, pp. 31–35
work page 2016
-
[28]
JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research,
K. Itou, M. Y amamoto, K.Takeda, T. Takezawa, T. Matsuok a, T. Kobayashi, K. Shikano, and S. Itahashi, “JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research,” Journal of the Acoustical Society of Japan (E) , vol. 20, no. 3, pp. 199–206, May 1999
work page 1999
-
[29]
H. Kawahara, I. Masuda-Katsuse, and A. D. Cheveigne, “R e- structuring speech representations using a pitch-adaptiv e time- frequency smoothing and an instantaneous-frequency-base d F0 extraction: Possible role of a repetitive structure in soun ds,” Speech Communication, vol. 27, no. 3–4, pp. 187–207, Apr. 1999
work page 1999
-
[30]
Adaptive subgradien t meth- ods for online learning and stochastic optimization,
J. Duchi, E. Hazan, and Y . Singer, “Adaptive subgradien t meth- ods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, pp. 2121–2159, Jul. 2011
work page 2011
-
[31]
Phonetic pos te- riorgrams for many-to-one voice conversion without parall el data training,
L. Sun, K. Li, H. Wang, S. Kang, and H. Meng, “Phonetic pos te- riorgrams for many-to-one voice conversion without parall el data training,” in Proc. ICME, Seattle, U.S.A., Jul. 2016
work page 2016
-
[32]
Deep sparse rectifi er neural networks,
X. Glorot, A. Bordes, and Y . Bengio, “Deep sparse rectifi er neural networks,” in Proc. AISTATS, Lauderdale, U.S.A., Apr. 2011, pp. 315–323
work page 2011
-
[33]
Speech parameter generation algorithms for HMM-bas ed speech synthesis,
K. Tokuda, T. Y oshimura, T. Masuko, T. Kobayashi, and T. Kita- mura, “Speech parameter generation algorithms for HMM-bas ed speech synthesis,” in Proc. ICASSP, Istanbul, Turkey, June 2000, pp. 1315–1318
work page 2000
-
[34]
H. Kawahara, Jo Estill, and O. Fujimura, “Aperiodicity extrac- tion and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modificati on and synthesis system STRAIGHT,” in MA VEBA 2001, Florence, Italy, Sep. 2001, pp. 1–6
work page 2001
-
[35]
Maxi mum likelihood voice conversion based on GMM with STRAIGHT mixed excitation,
Y . Ohtani, T. Toda, H. Saruwatari, and K. Shikano, “Maxi mum likelihood voice conversion based on GMM with STRAIGHT mixed excitation,” in Proc. INTERSPEECH, Pittsburgh, U.S.A., Sep. 2006, pp. 2266–2269
work page 2006
-
[36]
Semi-Supervised Classification with Graph Convolutional Networks
T. N. Kipf and M. Welling, “Semi-supervised classificat ion with graph convolutional networks,” arXiv, vol. abs/1609.02907, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.