DNN-based Speaker Embedding Using Subjective Inter-speaker Similarity for Multi-speaker Modeling in Speech Synthesis

Hiroshi Saruwatari; Shinnosuke Takamichi; Yuki Saito

arxiv: 1907.08294 · v1 · pith:2MN3YJNHnew · submitted 2019-07-19 · 📡 eess.AS · cs.LG· cs.SD· stat.ML

DNN-based Speaker Embedding Using Subjective Inter-speaker Similarity for Multi-speaker Modeling in Speech Synthesis

Yuki Saito , Shinnosuke Takamichi , Hiroshi Saruwatari This is my paper

Pith reviewed 2026-05-24 19:03 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SDstat.ML

keywords speaker embeddingDNNspeech synthesisinter-speaker similaritymulti-speaker modelingsubjective evaluationd-vectoropen speaker

0 comments

The pith

DNN speaker embeddings trained to match crowdsourced human similarity scores correlate with perception and raise synthesis quality for unseen speakers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard d-vector embeddings fail to capture how humans judge speaker similarity, so the authors train new DNN embeddings on a large matrix of subjective ratings collected from 153 Japanese female speakers. Two algorithms are introduced: one directly predicts each speaker's row in the similarity matrix, and the other forces the Gram matrix of the embeddings to match the subjective matrix. Experiments show the resulting vectors align closely with human judgments, and the direct vector-prediction method produces higher-quality synthetic speech when the target speaker was never heard during training.

Core claim

The central claim is that DNN-based speaker embedding models trained either to predict a vector of the subjective inter-speaker similarity matrix or to minimize the squared Frobenius norm between that matrix and the Gram matrix of the d-vectors learn embeddings that are highly correlated with subjective similarity, and that the similarity-vector approach improves the naturalness of DNN-based multi-speaker speech synthesis for open speakers whose utterances were absent from the training set.

What carries the argument

The crowdsourced inter-speaker similarity matrix, used either as direct regression targets for similarity-vector embedding or as the reference for Gram-matrix matching in similarity-matrix embedding.

If this is right

The similarity-vector embedding can be substituted for d-vectors inside existing multi-speaker DNN synthesizers.
Training targets derived from human similarity judgments yield embeddings that generalize to open speakers.
The Gram-matrix matching method also produces embeddings aligned with subjective similarity, though the vector-prediction method shows clearer synthesis gains.
Large-scale subjective scoring provides usable supervision for learning perceptually grounded speaker representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same training procedure could be applied to male speakers or to other languages to test whether the correlation with synthesis quality holds across different populations.
The approach suggests that future embedding objectives for synthesis should explicitly incorporate perceptual similarity rather than relying solely on classification or reconstruction losses.
If the subjective matrix captures dimensions relevant to prosody or timbre, the embeddings may also benefit voice-conversion or speaker-adaptation tasks that were not tested in the paper.

Load-bearing premise

The crowdsourced similarity scores from 153 Japanese female speakers serve as reliable ground truth that predicts which embeddings will produce better synthesis for speakers never seen in training.

What would settle it

A new set of listening tests on held-out speakers in which the proposed embeddings produce no measurable improvement in naturalness over conventional d-vectors, or in which the learned embeddings show low correlation with fresh subjective similarity ratings.

Figures

Figures reproduced from arXiv: 1907.08294 by Hiroshi Saruwatari, Shinnosuke Takamichi, Yuki Saito.

**Figure 1.** Figure 1: (a) Similarity matrix of 153 Japanese female speakers and (b) its sub-matrix obtained by large-scale subjective scoring [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Calculation of loss functions in proposed algorithms based on (a) similarity vector embedding and (b) similarity matrix embedding. 3.2. Training based on similarity vector embedding The first proposed algorithm uses the similarity vector as a target to be predicted by the DNNs, instead of the conventional speaker code. The loss function for the training is defined as follows: L (vec) SIM (s, sˆ) = 1 Ns (… view at source ↗

**Figure 4.** Figure 4: Histogram of similarity scores with its cumulative ratio denoted by red line. that the latent variables of the HMMs or GMMs were related to the word pairs. Our algorithms extend these ideas to make the DNNs model the pair-wise speaker similarity as the embedding vectors rather than the conventional point-wise impressions of one speaker. Furthermore, the relationship between the speakers’ intention and th… view at source ↗

**Figure 5.** Figure 5: Histogram of speaker-pair-wise similarity scores. The speaker pair includes the 13 speakers shown in [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Scatter plots of similarity scores si,j and values of kernel k(di, dj ) with their correlation coefficient r. These plots were made by all speaker pairs. 0 1 (a) Closed-Closed Similarity scores si,j r = 0.0616 (1) d-vec. r = 0.1904 (2) Prop. (vec) r = 0.3243 (3) Prop. (mat) r = 0.8919 (4) Prop. (mat-re) −1 0 1 0 1 (b) Closed-Open r = 0.0741 −1 0 1 r = 0.2315 Value of kernel k(di, dj ) −1 0 1 r = 0.2517 −1 … view at source ↗

**Figure 7.** Figure 7: Scatter plots of similarity scores si,j and values of kernel k(di, dj ) with their correlation coefficient r. These plots were made by speaker pairs whose similarity scores were greater than 0. for the encoder. The VAEs were trained to maximize the variational lower bound of the log likelihood [15] with 25 epochs using the same training data as in the embedding model training. The maximum likelihood para… view at source ↗

read the original abstract

This paper proposes novel algorithms for speaker embedding using subjective inter-speaker similarity based on deep neural networks (DNNs). Although conventional DNN-based speaker embedding such as a $d$-vector can be applied to multi-speaker modeling in speech synthesis, it does not correlate with the subjective inter-speaker similarity and is not necessarily appropriate speaker representation for open speakers whose speech utterances are not included in the training data. We propose two training algorithms for DNN-based speaker embedding model using an inter-speaker similarity matrix obtained by large-scale subjective scoring. One is based on similarity vector embedding and trains the model to predict a vector of the similarity matrix as speaker representation. The other is based on similarity matrix embedding and trains the model to minimize the squared Frobenius norm between the similarity matrix and the Gram matrix of $d$-vectors, i.e., the inter-speaker similarity derived from the $d$-vectors. We crowdsourced the inter-speaker similarity scores of 153 Japanese female speakers, and the experimental results demonstrate that our algorithms learn speaker embedding that is highly correlated with the subjective similarity. We also apply the proposed speaker embedding to multi-speaker modeling in DNN-based speech synthesis and reveal that the proposed similarity vector embedding improves synthetic speech quality for open speakers whose speech utterances are unseen during the training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper trains speaker embeddings to match a crowdsourced similarity matrix from 153 Japanese female speakers and reports better DNN-TTS quality for unseen voices, but the gains depend on how well listener scores track synthesis artifacts.

read the letter

The main point is that they collect listener ratings of how similar pairs of speakers sound and then train embeddings either to predict those ratings directly or to make the Gram matrix of the embeddings match the subjective matrix. This is applied to multi-speaker TTS and they claim the resulting embeddings give higher quality for speakers whose data was held out of training. The two objectives are concrete and not just re-statements of standard d-vector training. Collecting the matrix at that scale is useful work and the abstract indicates the embeddings do correlate with the human scores while also lifting synthesis quality on open speakers. That combination is the practical contribution. The evaluation of correlation is done on the training matrix itself, which makes the high correlation less surprising. The harder question is whether the listener similarity scores actually capture the acoustic dimensions that matter for TTS artifacts such as speaker consistency or prosody; the paper would need to show that the quality improvement survives strong baselines and is not limited to the narrow Japanese-female domain. The stress-test concern about the matrix acting as a reliable proxy is reasonable and would need direct checks in the experiments. This is aimed at people working on voice cloning and multi-speaker synthesis who already use embedding-based models. It is a focused empirical step rather than a broad theoretical advance, but the method is clear enough and the data collection is reproducible, so it deserves a serious referee to examine the numbers, splits, and controls.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes two DNN training algorithms for speaker embedding derived from a crowdsourced inter-speaker similarity matrix collected from 153 Japanese female speakers. Similarity vector embedding trains the model to predict rows of the similarity matrix; similarity matrix embedding minimizes the squared Frobenius norm between the subjective matrix and the Gram matrix of the learned d-vectors. The paper claims the resulting embeddings are highly correlated with subjective similarity and that the vector-embedding variant improves perceptual quality in DNN-based multi-speaker TTS for speakers unseen during training.

Significance. If the central empirical claims hold after proper validation, the work would supply a perceptually motivated alternative to conventional d-vectors that improves generalization to open speakers in TTS. The large-scale subjective data collection is a concrete strength; however, the link between the collected similarity matrix and synthesis-relevant acoustic dimensions remains an untested modeling assumption.

major comments (3)

[Abstract] Abstract: the claim that the embeddings are 'highly correlated with the subjective similarity' and 'improve synthetic speech quality for open speakers' is presented without any numerical correlation values, baseline comparisons (e.g., standard d-vector or i-vector), data-split details, or statistical tests, preventing assessment of effect size or risk of post-hoc selection.
[Experiments section (correlation results)] The evaluation of correlation between learned embeddings and subjective scores is performed on the identical crowdsourced matrix used to supervise training; this circularity means the reported 'high correlation' does not yet demonstrate that the embedding captures structure that generalizes beyond the training speakers or to synthesis artifacts.
[TTS experiments (open-speaker results)] The central claim that the similarity-vector embedding improves TTS quality for unseen speakers rests on the premise that the crowdsourced matrix encodes the perceptual dimensions that control synthesis artifacts (speaker consistency, prosody, etc.); no ablation or direct comparison isolating this alignment versus other factors (identity, recording conditions) is provided.

minor comments (2)

[Method] The notation distinguishing the conventional d-vector from the proposed similarity-derived embeddings would benefit from an explicit equation or diagram in the method section.
[Experimental setup] Details on the DNN architecture, optimizer, and hyper-parameters used for both embedding training and the downstream TTS system are missing or insufficiently referenced.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications, defenses on substance, and commitments to revisions that strengthen the presentation without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the embeddings are 'highly correlated with the subjective similarity' and 'improve synthetic speech quality for open speakers' is presented without any numerical correlation values, baseline comparisons (e.g., standard d-vector or i-vector), data-split details, or statistical tests, preventing assessment of effect size or risk of post-hoc selection.

Authors: We agree that the abstract would benefit from explicit numerical support. In the revised manuscript we will incorporate key quantitative results already reported in the experiments section: Pearson correlation coefficients for both proposed methods versus standard d-vector and i-vector baselines, speaker-disjoint data-split details, and reference to the statistical tests used. This will allow readers to assess effect sizes directly from the abstract. revision: yes
Referee: [Experiments section (correlation results)] The evaluation of correlation between learned embeddings and subjective scores is performed on the identical crowdsourced matrix used to supervise training; this circularity means the reported 'high correlation' does not yet demonstrate that the embedding captures structure that generalizes beyond the training speakers or to synthesis artifacts.

Authors: The reported correlations confirm that the models successfully recover the subjective similarity structure on which they were trained. To demonstrate generalization, the revised experiments will include a speaker-disjoint protocol: the embedding model will be retrained on a random subset of speakers and correlation will be evaluated on the held-out speakers' rows of the similarity matrix. This addition directly addresses the concern about structure beyond the training speakers. revision: partial
Referee: [TTS experiments (open-speaker results)] The central claim that the similarity-vector embedding improves TTS quality for unseen speakers rests on the premise that the crowdsourced matrix encodes the perceptual dimensions that control synthesis artifacts (speaker consistency, prosody, etc.); no ablation or direct comparison isolating this alignment versus other factors (identity, recording conditions) is provided.

Authors: The crowdsourced similarity scores are by definition perceptual; the observed improvement in MOS for open speakers relative to standard d-vectors already provides empirical support that the learned embeddings capture dimensions relevant to synthesis quality. We will expand the discussion to explicitly articulate this modeling assumption and its grounding in subjective data. A full factorial ablation isolating every acoustic factor is beyond the scope of the present study, but the existing baseline comparisons already control for speaker identity. revision: partial

Circularity Check

0 steps flagged

Empirical training directly targets subjective similarity matrix; correlation follows from objective but TTS gains are independent validation

full rationale

The paper defines two explicit training objectives (similarity vector prediction and Frobenius-norm matching to the Gram matrix) that take the crowdsourced inter-speaker similarity matrix as direct supervision. The claim of 'highly correlated' embeddings is therefore an expected outcome of optimization rather than an independent derivation. No self-citations, uniqueness theorems, or ansatzes are used in a load-bearing way, and the multi-speaker TTS experiments for unseen speakers constitute a separate empirical test. The overall chain remains self-contained against external benchmarks with only minor tautology in the correlation reporting.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard supervised DNN training assumptions and the validity of crowdsourced similarity scores as a perceptual ground truth. No free parameters, invented entities, or non-standard axioms are stated in the abstract.

axioms (2)

domain assumption Crowdsourced pairwise similarity scores form a reliable and stable representation of human inter-speaker perceptual distance.
The entire training procedure and the claim of improved correlation rest on treating the collected matrix as ground truth.
domain assumption Minimizing squared Frobenius distance between the similarity matrix and the Gram matrix of embeddings produces embeddings useful for downstream TTS.
This is the explicit training objective of the second algorithm.

pith-pipeline@v0.9.0 · 5779 in / 1432 out tokens · 19034 ms · 2026-05-24T19:03:40.015972+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 4 internal anchors

[1]

Recent developments of both training algori thms and acoustic modeling for SPSS using deep neural networks (DNNs) [2] have signiﬁcantly improved quality of synthetic speech

Introduction Statistical parametric speech synthesis (SPSS) [1] is a technique for synthesizing naturally sounding and easily controllab le syn- thetic speech. Recent developments of both training algori thms and acoustic modeling for SPSS using deep neural networks (DNNs) [2] have signiﬁcantly improved quality of synthetic speech. For instance, training ...

work page
[2]

Conventional speaker embedding 2.1. One-hot speaker code A speaker code [11] c = [ c(1), · · ·, c(n), · · ·, c(Ns)]⊤ is the 1-of-Ns representation for identifying the one of closed Ns speakers, which is the discrete representation of speaker iden- tity. The speaker code ci for the ith speaker is deﬁned as fol- lows: ci(n) = { 1 if n = i 0 otherwise (1 ≤ n...

work page
[3]

To what degree do the ith speaker’s voice and the jth speaker’s one sound similar?

Proposed speaker embedding Here, we propose two algorithms for learning speaker embed- ding that is highly correlated with the subjective inter-sp eaker similarity. 3.1. Subjective inter-speaker similarity matrix We deﬁne a subjective inter-speaker similarity matrix that rep- resents the speaker-pair similarity perceived by listener s. Let S = [ s1, · · ·...

work page
[4]

F001-F009

Experimental Evaluation 4.1. Experimental conditions 4.1.1. Conditions for large-scale subjective scoring We conducted large-scale subjective scoring for obtaining the similarity matrix S by using our crowdsourced evaluation sys- tems. We used 153 Japanese female speakers included in the JNAS corpus [22]. Each speaker utters at least 150 reading- style ut...

work page 1904
[5]

The algorithms used an inter-speaker similarity ma - trix obtained from large-scale subjective scoring as a cons traint on training the model

Conclusion This paper proposed novel algorithms for incorporating sub jec- tive inter-speaker similarity perceived by listeners into the train- ing a speaker embedding model based on deep neural networks (DNNs). The algorithms used an inter-speaker similarity ma - trix obtained from large-scale subjective scoring as a cons traint on training the model. Tw...

work page
[6]

Acknowledgements Part of this work was supported by the SECOM Science and Technology Foundation, JSPS KAKENHI Grant Number 18J22090 and 17H06101, the Ministry of Internal Affairs and Communications, and the GAP foundation program of the Uni- versity of Tokyo

work page
[7]

Statistical parametric speech synthesis,

H. Zen, K. Tokuda, and A. Black, “Statistical parametric speech synthesis,” Speech Communication, vol. 51, no. 11, pp. 1039– 1064, Nov. 2009

work page 2009
[8]

Statistical paramet ric speech synthesis using deep neural networks,

H. Zen, A. Senior, and M. Schuster, “Statistical paramet ric speech synthesis using deep neural networks,” in Proc. ICASSP, V ancou- ver, Canada, May 2013, pp. 7962–7966

work page 2013
[9]

Generative a d- versarial nets,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Ward e- Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative a d- versarial nets,” in Proc. NIPS, Montreal, Canada, Dec. 2014, pp. 2672–2680

work page 2014
[10]

Statistical parametric speech synthesis incorporating generative adversarial ne tworks,

Y . Saito, S. Takamichi, and H. Saruwatari, “Statistical parametric speech synthesis incorporating generative adversarial ne tworks,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 26, no. 1, pp. 84–96, Jan. 2018

work page 2018
[11]

Text-to-spe ech syn- thesis using STFT spectra based on low-/multi-resolution g ener- ative adversarial networks,

Y . Saito, S. Takamichi, and H. Saruwatari, “Text-to-spe ech syn- thesis using STFT spectra based on low-/multi-resolution g ener- ative adversarial networks,” in Proc. ICASSP, Alberta, Canada, Apr. 2018, pp. 5299–5303

work page 2018
[12]

Wasserstein GAN and waveform loss-based acoustic model training for multi-speaker text-to-speech synthe- sis systems using a WaveNet vocoder,

Y . Zhao, S. Takaki, H.-T. Luong, J. Y amagishi, D. Saito, a nd N. Minematsu, “Wasserstein GAN and waveform loss-based acoustic model training for multi-speaker text-to-speech synthe- sis systems using a WaveNet vocoder,” IEEE Access, vol. 6, pp. 60478–60488, Sep. 2018

work page 2018
[13]

Tacotron: Towards End-to-End Speech Synthesis

Y . Wang, RJ Skerry-Ryan, D. Stanton, Y . Wu, R.-J. Weiss, N. Jaitly, Z. Y ang, Y . Xiao, Z. Chen, S. Bengio, Q. Le, Y . Agiomyrgiannakis, R. Clark, and R.-A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” arXiv, vol. abs/1703.10135, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

WaveNet: A Generative Model for Raw Audio

A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw audio,” arXiv, vol. abs/1609.03499, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[15]

Speaker-dependent WaveNet vocoder,

A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda, “Speaker-dependent WaveNet vocoder,” in Proc. IN- TERSPEECH, Stockholm, Sweden, Aug. 2017, pp. 1118–1122

work page 2017
[16]

Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Y ang, Z. Chen, Y . Zhang, Y . Wang, RJ Skerry-Ryan, R. A. Saurous, Y . Agiomyrgiannakis, and Y . Wu, “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in Proc. ICASSP, Calgary, Canada, Apr. 2018, pp. 4779–4783

work page 2018
[17]

DNN-based speech synt hesis using speaker codes,

N. Hojo, Y . Ijima, and H. Mizuno, “DNN-based speech synt hesis using speaker codes,” IEICE Transactions on Information and Systems, vol. E101-D, no. 2, pp. 462–472, Feb. 2018

work page 2018
[18]

Adapting and controlling DNN-based speech synthesis using input cod es,

H.-T. Luong, S. Takaki, G. E. Henter, and J. Y amagishi, “Adapting and controlling DNN-based speech synthesis using input cod es,” in Proc. ICASSP , New Orleans, U.S.A., May 2017, pp. 1905– 1909

work page 2017
[19]

Deep neural networks for small footprint text- dependent speaker veriﬁcation,

E. V ariani, X. Lei, E. McDermott, I. L. Moreno, and J. Gon zalez- Dominguez, “Deep neural networks for small footprint text- dependent speaker veriﬁcation,” in Proc. ICASSP, Florence, Italy, May 2014, pp. 4080–4084

work page 2014
[20]

Non-p arallel voice conversion using variational autoencoders conditio ned by phonetic posteriorgrams and d-vectors,

Y . Saito, Y . Ijima, K. Nishida, and S. Takamichi, “Non-p arallel voice conversion using variational autoencoders conditio ned by phonetic posteriorgrams and d-vectors,” in Proc. ICASSP , Al- berta, Canada, Apr. 2018, pp. 5274–5278

work page 2018
[21]

Auto-Encoding Variational Bayes

D. P . Kingma and M. Welling, “Auto-encoding variational Bayes,” arXiv, vol. abs/1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[22]

The emerging ﬁeld of signal processing on grap hs: Extending high-dimensional data analysis to networks and o ther irregular domains,

D. I. Shuman, S. K. Narang, P . Frossard, A. Ortega, and P . V an- dergheynst, “The emerging ﬁeld of signal processing on grap hs: Extending high-dimensional data analysis to networks and o ther irregular domains,” IEEE Signal Processing Magazine , vol. 30, no. 3, pp. 83–98, May 2013

work page 2013
[23]

Graph embedding techniques, a pplica- tions, and performance: A survey,

P . Goyal and E. Ferrara, “Graph embedding techniques, a pplica- tions, and performance: A survey,” Knowledge-Based Systems , vol. 151, pp. 78–94, Jul. 2018

work page 2018
[24]

A tech- nique for controlling voice quality of synthetic speech usi ng mul- tiple regression HSMM,

M. Tachibana, T. Nose, J. Y amagishi, and T. Kobayashi, “ A tech- nique for controlling voice quality of synthetic speech usi ng mul- tiple regression HSMM,” in Proc. ICSLP, Pittsburgh, U.S.A., Sep. 2006, pp. 2438–2441

work page 2006
[25]

Regression approaches to voice quality control based on on e-to- many eigenvoice conversion,

K. Ohta, Y . Ohtani, T. Toda, H. Saruwatari, and K. Shikan o, “Regression approaches to voice quality control based on on e-to- many eigenvoice conversion,” in Proc. ICSLP, Bonn, Germany, Aug. 2007, pp. 101–106

work page 2007
[26]

Investigating different representations for modeling and controlling multiple emotions in DNN-base d speech synthesis,

J. Lorenzo-Trueba, G. E. Henter, S. Takaki, J. Y amagish i, Y . Morino, and Y . Ochiai, “Investigating different representations for modeling and controlling multiple emotions in DNN-base d speech synthesis,” Speech Communication, vol. 99, pp. 135–143, May 2018

work page 2018
[27]

Deep clus- tering: Discriminative embeddings for segmentation and se para- tion,

J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, “Deep clus- tering: Discriminative embeddings for segmentation and se para- tion,” in Proc. ICASSP, Shanghai, China, Mar. 2016, pp. 31–35

work page 2016
[28]

JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research,

K. Itou, M. Y amamoto, K.Takeda, T. Takezawa, T. Matsuok a, T. Kobayashi, K. Shikano, and S. Itahashi, “JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research,” Journal of the Acoustical Society of Japan (E) , vol. 20, no. 3, pp. 199–206, May 1999

work page 1999
[29]

Kawahara, I

H. Kawahara, I. Masuda-Katsuse, and A. D. Cheveigne, “R e- structuring speech representations using a pitch-adaptiv e time- frequency smoothing and an instantaneous-frequency-base d F0 extraction: Possible role of a repetitive structure in soun ds,” Speech Communication, vol. 27, no. 3–4, pp. 187–207, Apr. 1999

work page 1999
[30]

Adaptive subgradien t meth- ods for online learning and stochastic optimization,

J. Duchi, E. Hazan, and Y . Singer, “Adaptive subgradien t meth- ods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, pp. 2121–2159, Jul. 2011

work page 2011
[31]

Phonetic pos te- riorgrams for many-to-one voice conversion without parall el data training,

L. Sun, K. Li, H. Wang, S. Kang, and H. Meng, “Phonetic pos te- riorgrams for many-to-one voice conversion without parall el data training,” in Proc. ICME, Seattle, U.S.A., Jul. 2016

work page 2016
[32]

Deep sparse rectiﬁ er neural networks,

X. Glorot, A. Bordes, and Y . Bengio, “Deep sparse rectiﬁ er neural networks,” in Proc. AISTATS, Lauderdale, U.S.A., Apr. 2011, pp. 315–323

work page 2011
[33]

Speech parameter generation algorithms for HMM-bas ed speech synthesis,

K. Tokuda, T. Y oshimura, T. Masuko, T. Kobayashi, and T. Kita- mura, “Speech parameter generation algorithms for HMM-bas ed speech synthesis,” in Proc. ICASSP, Istanbul, Turkey, June 2000, pp. 1315–1318

work page 2000
[34]

Aperiodicity extrac- tion and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modiﬁcati on and synthesis system STRAIGHT,

H. Kawahara, Jo Estill, and O. Fujimura, “Aperiodicity extrac- tion and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modiﬁcati on and synthesis system STRAIGHT,” in MA VEBA 2001, Florence, Italy, Sep. 2001, pp. 1–6

work page 2001
[35]

Maxi mum likelihood voice conversion based on GMM with STRAIGHT mixed excitation,

Y . Ohtani, T. Toda, H. Saruwatari, and K. Shikano, “Maxi mum likelihood voice conversion based on GMM with STRAIGHT mixed excitation,” in Proc. INTERSPEECH, Pittsburgh, U.S.A., Sep. 2006, pp. 2266–2269

work page 2006
[36]

Semi-Supervised Classification with Graph Convolutional Networks

T. N. Kipf and M. Welling, “Semi-supervised classiﬁcat ion with graph convolutional networks,” arXiv, vol. abs/1609.02907, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[1] [1]

Recent developments of both training algori thms and acoustic modeling for SPSS using deep neural networks (DNNs) [2] have signiﬁcantly improved quality of synthetic speech

Introduction Statistical parametric speech synthesis (SPSS) [1] is a technique for synthesizing naturally sounding and easily controllab le syn- thetic speech. Recent developments of both training algori thms and acoustic modeling for SPSS using deep neural networks (DNNs) [2] have signiﬁcantly improved quality of synthetic speech. For instance, training ...

work page

[2] [2]

Conventional speaker embedding 2.1. One-hot speaker code A speaker code [11] c = [ c(1), · · ·, c(n), · · ·, c(Ns)]⊤ is the 1-of-Ns representation for identifying the one of closed Ns speakers, which is the discrete representation of speaker iden- tity. The speaker code ci for the ith speaker is deﬁned as fol- lows: ci(n) = { 1 if n = i 0 otherwise (1 ≤ n...

work page

[3] [3]

To what degree do the ith speaker’s voice and the jth speaker’s one sound similar?

Proposed speaker embedding Here, we propose two algorithms for learning speaker embed- ding that is highly correlated with the subjective inter-sp eaker similarity. 3.1. Subjective inter-speaker similarity matrix We deﬁne a subjective inter-speaker similarity matrix that rep- resents the speaker-pair similarity perceived by listener s. Let S = [ s1, · · ·...

work page

[4] [4]

F001-F009

Experimental Evaluation 4.1. Experimental conditions 4.1.1. Conditions for large-scale subjective scoring We conducted large-scale subjective scoring for obtaining the similarity matrix S by using our crowdsourced evaluation sys- tems. We used 153 Japanese female speakers included in the JNAS corpus [22]. Each speaker utters at least 150 reading- style ut...

work page 1904

[5] [5]

The algorithms used an inter-speaker similarity ma - trix obtained from large-scale subjective scoring as a cons traint on training the model

Conclusion This paper proposed novel algorithms for incorporating sub jec- tive inter-speaker similarity perceived by listeners into the train- ing a speaker embedding model based on deep neural networks (DNNs). The algorithms used an inter-speaker similarity ma - trix obtained from large-scale subjective scoring as a cons traint on training the model. Tw...

work page

[6] [6]

Acknowledgements Part of this work was supported by the SECOM Science and Technology Foundation, JSPS KAKENHI Grant Number 18J22090 and 17H06101, the Ministry of Internal Affairs and Communications, and the GAP foundation program of the Uni- versity of Tokyo

work page

[7] [7]

Statistical parametric speech synthesis,

H. Zen, K. Tokuda, and A. Black, “Statistical parametric speech synthesis,” Speech Communication, vol. 51, no. 11, pp. 1039– 1064, Nov. 2009

work page 2009

[8] [8]

Statistical paramet ric speech synthesis using deep neural networks,

H. Zen, A. Senior, and M. Schuster, “Statistical paramet ric speech synthesis using deep neural networks,” in Proc. ICASSP, V ancou- ver, Canada, May 2013, pp. 7962–7966

work page 2013

[9] [9]

Generative a d- versarial nets,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Ward e- Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative a d- versarial nets,” in Proc. NIPS, Montreal, Canada, Dec. 2014, pp. 2672–2680

work page 2014

[10] [10]

Statistical parametric speech synthesis incorporating generative adversarial ne tworks,

Y . Saito, S. Takamichi, and H. Saruwatari, “Statistical parametric speech synthesis incorporating generative adversarial ne tworks,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 26, no. 1, pp. 84–96, Jan. 2018

work page 2018

[11] [11]

Text-to-spe ech syn- thesis using STFT spectra based on low-/multi-resolution g ener- ative adversarial networks,

Y . Saito, S. Takamichi, and H. Saruwatari, “Text-to-spe ech syn- thesis using STFT spectra based on low-/multi-resolution g ener- ative adversarial networks,” in Proc. ICASSP, Alberta, Canada, Apr. 2018, pp. 5299–5303

work page 2018

[12] [12]

Wasserstein GAN and waveform loss-based acoustic model training for multi-speaker text-to-speech synthe- sis systems using a WaveNet vocoder,

Y . Zhao, S. Takaki, H.-T. Luong, J. Y amagishi, D. Saito, a nd N. Minematsu, “Wasserstein GAN and waveform loss-based acoustic model training for multi-speaker text-to-speech synthe- sis systems using a WaveNet vocoder,” IEEE Access, vol. 6, pp. 60478–60488, Sep. 2018

work page 2018

[13] [13]

Tacotron: Towards End-to-End Speech Synthesis

Y . Wang, RJ Skerry-Ryan, D. Stanton, Y . Wu, R.-J. Weiss, N. Jaitly, Z. Y ang, Y . Xiao, Z. Chen, S. Bengio, Q. Le, Y . Agiomyrgiannakis, R. Clark, and R.-A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” arXiv, vol. abs/1703.10135, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

WaveNet: A Generative Model for Raw Audio

A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw audio,” arXiv, vol. abs/1609.03499, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[15] [15]

Speaker-dependent WaveNet vocoder,

A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda, “Speaker-dependent WaveNet vocoder,” in Proc. IN- TERSPEECH, Stockholm, Sweden, Aug. 2017, pp. 1118–1122

work page 2017

[16] [16]

Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Y ang, Z. Chen, Y . Zhang, Y . Wang, RJ Skerry-Ryan, R. A. Saurous, Y . Agiomyrgiannakis, and Y . Wu, “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in Proc. ICASSP, Calgary, Canada, Apr. 2018, pp. 4779–4783

work page 2018

[17] [17]

DNN-based speech synt hesis using speaker codes,

N. Hojo, Y . Ijima, and H. Mizuno, “DNN-based speech synt hesis using speaker codes,” IEICE Transactions on Information and Systems, vol. E101-D, no. 2, pp. 462–472, Feb. 2018

work page 2018

[18] [18]

Adapting and controlling DNN-based speech synthesis using input cod es,

H.-T. Luong, S. Takaki, G. E. Henter, and J. Y amagishi, “Adapting and controlling DNN-based speech synthesis using input cod es,” in Proc. ICASSP , New Orleans, U.S.A., May 2017, pp. 1905– 1909

work page 2017

[19] [19]

Deep neural networks for small footprint text- dependent speaker veriﬁcation,

E. V ariani, X. Lei, E. McDermott, I. L. Moreno, and J. Gon zalez- Dominguez, “Deep neural networks for small footprint text- dependent speaker veriﬁcation,” in Proc. ICASSP, Florence, Italy, May 2014, pp. 4080–4084

work page 2014

[20] [20]

Non-p arallel voice conversion using variational autoencoders conditio ned by phonetic posteriorgrams and d-vectors,

Y . Saito, Y . Ijima, K. Nishida, and S. Takamichi, “Non-p arallel voice conversion using variational autoencoders conditio ned by phonetic posteriorgrams and d-vectors,” in Proc. ICASSP , Al- berta, Canada, Apr. 2018, pp. 5274–5278

work page 2018

[21] [21]

Auto-Encoding Variational Bayes

D. P . Kingma and M. Welling, “Auto-encoding variational Bayes,” arXiv, vol. abs/1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[22] [22]

The emerging ﬁeld of signal processing on grap hs: Extending high-dimensional data analysis to networks and o ther irregular domains,

D. I. Shuman, S. K. Narang, P . Frossard, A. Ortega, and P . V an- dergheynst, “The emerging ﬁeld of signal processing on grap hs: Extending high-dimensional data analysis to networks and o ther irregular domains,” IEEE Signal Processing Magazine , vol. 30, no. 3, pp. 83–98, May 2013

work page 2013

[23] [23]

Graph embedding techniques, a pplica- tions, and performance: A survey,

P . Goyal and E. Ferrara, “Graph embedding techniques, a pplica- tions, and performance: A survey,” Knowledge-Based Systems , vol. 151, pp. 78–94, Jul. 2018

work page 2018

[24] [24]

A tech- nique for controlling voice quality of synthetic speech usi ng mul- tiple regression HSMM,

M. Tachibana, T. Nose, J. Y amagishi, and T. Kobayashi, “ A tech- nique for controlling voice quality of synthetic speech usi ng mul- tiple regression HSMM,” in Proc. ICSLP, Pittsburgh, U.S.A., Sep. 2006, pp. 2438–2441

work page 2006

[25] [25]

Regression approaches to voice quality control based on on e-to- many eigenvoice conversion,

K. Ohta, Y . Ohtani, T. Toda, H. Saruwatari, and K. Shikan o, “Regression approaches to voice quality control based on on e-to- many eigenvoice conversion,” in Proc. ICSLP, Bonn, Germany, Aug. 2007, pp. 101–106

work page 2007

[26] [26]

Investigating different representations for modeling and controlling multiple emotions in DNN-base d speech synthesis,

J. Lorenzo-Trueba, G. E. Henter, S. Takaki, J. Y amagish i, Y . Morino, and Y . Ochiai, “Investigating different representations for modeling and controlling multiple emotions in DNN-base d speech synthesis,” Speech Communication, vol. 99, pp. 135–143, May 2018

work page 2018

[27] [27]

Deep clus- tering: Discriminative embeddings for segmentation and se para- tion,

J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, “Deep clus- tering: Discriminative embeddings for segmentation and se para- tion,” in Proc. ICASSP, Shanghai, China, Mar. 2016, pp. 31–35

work page 2016

[28] [28]

JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research,

K. Itou, M. Y amamoto, K.Takeda, T. Takezawa, T. Matsuok a, T. Kobayashi, K. Shikano, and S. Itahashi, “JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research,” Journal of the Acoustical Society of Japan (E) , vol. 20, no. 3, pp. 199–206, May 1999

work page 1999

[29] [29]

Kawahara, I

H. Kawahara, I. Masuda-Katsuse, and A. D. Cheveigne, “R e- structuring speech representations using a pitch-adaptiv e time- frequency smoothing and an instantaneous-frequency-base d F0 extraction: Possible role of a repetitive structure in soun ds,” Speech Communication, vol. 27, no. 3–4, pp. 187–207, Apr. 1999

work page 1999

[30] [30]

Adaptive subgradien t meth- ods for online learning and stochastic optimization,

J. Duchi, E. Hazan, and Y . Singer, “Adaptive subgradien t meth- ods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, pp. 2121–2159, Jul. 2011

work page 2011

[31] [31]

Phonetic pos te- riorgrams for many-to-one voice conversion without parall el data training,

L. Sun, K. Li, H. Wang, S. Kang, and H. Meng, “Phonetic pos te- riorgrams for many-to-one voice conversion without parall el data training,” in Proc. ICME, Seattle, U.S.A., Jul. 2016

work page 2016

[32] [32]

Deep sparse rectiﬁ er neural networks,

X. Glorot, A. Bordes, and Y . Bengio, “Deep sparse rectiﬁ er neural networks,” in Proc. AISTATS, Lauderdale, U.S.A., Apr. 2011, pp. 315–323

work page 2011

[33] [33]

Speech parameter generation algorithms for HMM-bas ed speech synthesis,

K. Tokuda, T. Y oshimura, T. Masuko, T. Kobayashi, and T. Kita- mura, “Speech parameter generation algorithms for HMM-bas ed speech synthesis,” in Proc. ICASSP, Istanbul, Turkey, June 2000, pp. 1315–1318

work page 2000

[34] [34]

Aperiodicity extrac- tion and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modiﬁcati on and synthesis system STRAIGHT,

H. Kawahara, Jo Estill, and O. Fujimura, “Aperiodicity extrac- tion and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modiﬁcati on and synthesis system STRAIGHT,” in MA VEBA 2001, Florence, Italy, Sep. 2001, pp. 1–6

work page 2001

[35] [35]

Maxi mum likelihood voice conversion based on GMM with STRAIGHT mixed excitation,

Y . Ohtani, T. Toda, H. Saruwatari, and K. Shikano, “Maxi mum likelihood voice conversion based on GMM with STRAIGHT mixed excitation,” in Proc. INTERSPEECH, Pittsburgh, U.S.A., Sep. 2006, pp. 2266–2269

work page 2006

[36] [36]

Semi-Supervised Classification with Graph Convolutional Networks

T. N. Kipf and M. Welling, “Semi-supervised classiﬁcat ion with graph convolutional networks,” arXiv, vol. abs/1609.02907, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016