pith. sign in

arxiv: 1907.06111 · v1 · pith:SB6GPRJZnew · submitted 2019-07-13 · 📡 eess.AS · cs.CL· cs.SD

Speaker Recognition with Random Digit Strings Using Uncertainty Normalized HMM-based i-vectors

Pith reviewed 2026-05-24 21:49 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.SD
keywords speaker recognitioni-vectorsHMMtext-dependent verificationrandom digitsuncertainty normalizationRSR2015RedDots
0
0 comments X

The pith

Digit-specific HMMs enable per-digit i-vectors that reach 1.52% EER on random-digit speaker verification using only one training corpus.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that hidden Markov models tied to individual digits can segment random-digit strings, align frames to states, and feed localized statistics into separate i-vector extractors for each digit. This produces i-vectors that model only the phonetic content of a single digit rather than mixing across an utterance. A new uncertainty normalization step is introduced to handle variability in those estimates, and the resulting system is scored with plain cosine distance after simple normalization. The approach yields lower error rates than x-vector systems trained on far larger datasets while requiring no multi-handset recordings per speaker.

Core claim

Digit-specific HMMs segment utterances into digits and supply frame alignments for extracting Baum-Welch statistics; digit-specific i-vector extractors are then trained on those statistics so each i-vector models only one digit's phonetic content; uncertainty in the i-vector estimates is normalized before scoring; on RSR2015 part III this single system trained only on that corpus attains 1.52% EER for males and 1.77% EER for females using score-normalized cosine distance, outperforming x-vectors and showing only minor loss when channel compensation is omitted.

What carries the argument

Digit-specific HMMs that perform segmentation and state alignment, feeding per-digit i-vector extractors whose outputs receive uncertainty normalization.

If this is right

  • Omission of channel compensation produces only minor performance loss, so the method does not require multiple handsets per speaker.
  • The same pipeline applied to phrases on the RedDots corpus yields comparable gains over baselines.
  • Fusion of the spectral i-vectors with bottleneck features produces additional error reduction.
  • State-of-the-art results are obtained with a single system and simple cosine scoring rather than complex back-ends.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The per-digit localization may reduce sensitivity to phonetic mismatch in text-dependent tasks beyond digits.
  • Uncertainty normalization could be tested on other embedding extractors to check whether the gain is specific to i-vectors.
  • Because the method needs little channel diversity, it may suit deployment scenarios where only single-device enrollment data is available.

Load-bearing premise

Digit-specific HMMs trained on the same corpus can reliably segment random-digit utterances and produce frame alignments accurate enough for the per-digit i-vector extractors to remain well-localized.

What would settle it

Replace the HMM-derived alignments with random or cross-digit alignments and measure whether the reported EER on RSR2015 part III rises above 2.5% for both genders.

Figures

Figures reproduced from arXiv: 1907.06111 by Hossein Sameti, Hossein Zeinali, Nooshin Maghsoodi, Themos~Stafylakis.

Figure 1
Figure 1. Figure 1: Block diagram of the proposed system during enrollment phase. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: DET curves for the proposed methods for female speakers. The trends [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

In this paper, we combine Hidden Markov Models (HMMs) with i-vector extractors to address the problem of text-dependent speaker recognition with random digit strings. We employ digit-specific HMMs to segment the utterances into digits, to perform frame alignment to HMM states and to extract Baum-Welch statistics. By making use of the natural partition of input features into digits, we train digit-specific i-vector extractors on top of each HMM and we extract well-localized i-vectors, each modelling merely the phonetic content corresponding to a single digit. We then examine ways to perform channel and uncertainty compensation, and we propose a novel method for using the uncertainty in the i-vector estimates. The experiments on RSR2015 part III show that the proposed method attains 1.52\% and 1.77\% Equal Error Rate (EER) for male and female respectively, outperforming state-of-the-art methods such as x-vectors, trained on vast amounts of data. Furthermore, these results are attained by a single system trained entirely on RSR2015, and by a simple score-normalized cosine distance. Moreover, we show that the omission of channel compensation yields only a minor degradation in performance, meaning that the system attains state-of-the-art results even without recordings from multiple handsets per speaker for training or enrolment. Similar conclusions are drawn from our experiments on the RedDots corpus, where the same method is evaluated on phrases. Finally, we report results with bottleneck features and show that further improvement is attained when fusing them with spectral features.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes combining digit-specific HMMs with i-vector extractors for text-dependent speaker recognition on random digit strings. Digit-specific HMMs segment utterances, provide state alignments, and accumulate Baum-Welch statistics for training per-digit i-vector extractors that produce localized representations; a novel uncertainty normalization is introduced, followed by score-normalized cosine scoring. On RSR2015 part III the system reports 1.52% EER (male) and 1.77% EER (female), outperforming x-vectors trained on much larger data; similar conclusions are drawn on RedDots phrases, with only minor degradation when channel compensation is omitted and further gains when fusing bottleneck features.

Significance. If the central claims hold, the work demonstrates that a compact, single-system pipeline trained exclusively on RSR2015 can surpass data-intensive x-vector baselines while remaining robust to the absence of multi-handset channel data. The explicit use of phonetic partitioning via HMMs and the uncertainty-handling technique constitute concrete, falsifiable contributions that could influence practical text-dependent systems. The public-corpus evaluation protocol and the reported minor impact of channel compensation are reproducible strengths.

major comments (2)
  1. [Abstract and §3] Abstract and §3: The claim that each extracted i-vector 'modelling merely the phonetic content corresponding to a single digit' requires that digit-specific HMM alignments remain accurate under co-articulation and speaker variation. No alignment-error metrics, forced-alignment comparisons against reference transcriptions, or ablation removing the HMM segmentation step are reported; if alignments are noisy the subsequent localization and uncertainty normalization rest on an untested premise.
  2. [Abstract] Abstract: The reported EERs of 1.52% (male) and 1.77% (female) are presented as outperforming x-vectors without accompanying error bars, confidence intervals, or statistical significance tests across multiple training seeds or folds, weakening the strength of the outperformance claim.
minor comments (2)
  1. [Abstract] The abstract states that 'the omission of channel compensation yields only a minor degradation' but does not quantify the exact EER increase or identify the table/figure containing the comparison.
  2. [Abstract] Notation for the uncertainty normalization procedure is introduced without an explicit equation reference in the abstract; readers must wait until the methods section to locate the precise formulation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below, indicating planned revisions to strengthen the manuscript where appropriate.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3: The claim that each extracted i-vector 'modelling merely the phonetic content corresponding to a single digit' requires that digit-specific HMM alignments remain accurate under co-articulation and speaker variation. No alignment-error metrics, forced-alignment comparisons against reference transcriptions, or ablation removing the HMM segmentation step are reported; if alignments are noisy the subsequent localization and uncertainty normalization rest on an untested premise.

    Authors: We agree that explicit validation of alignment accuracy would strengthen the localization premise. The digit-specific HMMs are trained supervised on RSR2015 using the provided transcriptions, following standard practice for text-dependent tasks. The uncertainty normalization is designed to account for estimation variability that may include minor alignment effects. In the revised manuscript we will expand the discussion in §3 to address alignment robustness under co-articulation and speaker variation, and we will add a qualitative analysis of alignment stability on a subset of utterances. revision: partial

  2. Referee: [Abstract] Abstract: The reported EERs of 1.52% (male) and 1.77% (female) are presented as outperforming x-vectors without accompanying error bars, confidence intervals, or statistical significance tests across multiple training seeds or folds, weakening the strength of the outperformance claim.

    Authors: We acknowledge that the lack of error bars or significance tests weakens the quantitative strength of the outperformance statement. The reported figures follow the fixed, single-run protocol defined for RSR2015 part III; multiple random seeds were not explored due to computational cost. In the revised version we will qualify the abstract and results sections to note that the EERs are obtained from the standard single-run evaluation on this corpus and that the margin over the x-vector baseline is substantial. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's pipeline (digit-specific HMMs for segmentation and Baum-Welch statistics feeding per-digit i-vector extractors, followed by uncertainty normalization and cosine scoring) relies on standard, externally established techniques without any reduction of reported EERs or claims to fitted parameters by construction, self-citation chains, or ansatz smuggling. No equations or steps in the provided text equate outputs to inputs via self-definition or renaming; results are presented as empirical outcomes on RSR2015 and RedDots, independent of the method's internal assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; the method implicitly relies on standard HMM and i-vector assumptions plus the unstated claim that digit boundaries can be recovered accurately enough from the same limited corpus. No explicit free parameters or invented entities are named.

axioms (1)
  • domain assumption Digit-specific HMMs trained on RSR2015 produce reliable state alignments for random digit strings
    Invoked in the description of segmentation and Baum-Welch statistics extraction

pith-pipeline@v0.9.0 · 5833 in / 1348 out tokens · 16822 ms · 2026-05-24T21:49:11.799425+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 3 internal anchors

  1. [1]

    Front-end factor analysis for speaker verification,

    N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011

  2. [2]

    Probabilistic linear discriminant analysis,

    S. Ioffe, “Probabilistic linear discriminant analysis,” in Proc. Computer Vision–ECCV 2006. New York, NY , USA: Springer, 2006, pp. 531–542

  3. [3]

    Well- calibrated heavy tailed Bayesian speaker verification for microphone speech,

    M. Senoussaoui, P. Kenny, P. Dumouchel, and F. Castaldo, “Well- calibrated heavy tailed Bayesian speaker verification for microphone speech,” in Proc. ICASSP. IEEE, 2011, pp. 4824–4827

  4. [4]

    The NIST speaker recognition evaluation–overview, methodology, sys- tems, results, perspective,

    G. R. Doddington, M. A. Przybocki, A. F. Martin, and D. A. Reynolds, “The NIST speaker recognition evaluation–overview, methodology, sys- tems, results, perspective,” Speech Communication , vol. 31, no. 2, pp. 225–254, 2000

  5. [5]

    X- vectors: Robust dnn embeddings for speaker recognition,

    D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X- vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333

  6. [6]

    Speaker recognition for multi-speaker conversations using x-vectors,

    D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, and S. Khudanpur, “Speaker recognition for multi-speaker conversations using x-vectors,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 5796–5800

  7. [7]

    Text-dependent speaker recognition using PLDA with uncertainty propagation,

    T. Stafylakis, P. Kenny, P. Ouellet, J. Perez, M. Kockmann, and P. Dumouchel, “Text-dependent speaker recognition using PLDA with uncertainty propagation,” in Proc. Interspeech, 2013, pp. 3684–3688

  8. [8]

    PLDA for speaker verification with utterances of arbitrary duration,

    P. Kenny, T. Stafylakis, P. Ouellet, M. J. Alam, and P. Dumouchel, “PLDA for speaker verification with utterances of arbitrary duration,” Proc. ICASSP, pp. 7649–7653, 2013

  9. [9]

    On the use of i-vector posterior distributions in probabilistic linear discriminant analysis,

    S. Cumani, O. Plchot, and P. Laface, “On the use of i-vector posterior distributions in probabilistic linear discriminant analysis,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , vol. 22, no. 4, pp. 846–857, 2014

  10. [10]

    The 2nd automatic speaker verification spoofing and countermeasures challenge (asvspoof 2017) database,

    N. Evans, M. Sahidullah, J. Yamagishi, M. Todisco, K. A. Lee, H. Del- gado, T. Kinnunen et al. , “The 2nd automatic speaker verification spoofing and countermeasures challenge (asvspoof 2017) database,” 2017

  11. [11]

    ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection

    M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, “Asvspoof 2019: Future horizons in spoofed and fake audio detection,” arXiv preprint arXiv:1904.05441 , 2019

  12. [12]

    Text-dependent speaker ver- ification: Classifiers, databases and RSR2015,

    A. Larcher, K. A. Lee, B. Ma, and H. Li, “Text-dependent speaker ver- ification: Classifiers, databases and RSR2015,” Speech Communication, vol. 60, pp. 56–77, 2014

  13. [13]

    The RSR2015: Database for text- dependent speaker verification using multiple pass-phrases,

    A. Larcher, K. A. Lee, and B. Ma, “The RSR2015: Database for text- dependent speaker verification using multiple pass-phrases,” in Proc. Interspeech, 2012

  14. [14]

    Uncer- tainty modeling without subspace methods for text-dependent speaker recognition,

    P. Kenny, T. Stafylakis, J. Alam, V . Gupta, and M. Kockmann, “Uncer- tainty modeling without subspace methods for text-dependent speaker recognition,” in Proc. Odyssey-The Speaker and Language Recognition Workshop, 2016, pp. 16–23

  15. [15]

    Deep neural networks and hidden Markov models in i-vector-based text- dependent speaker verification,

    H. Zeinali, L. Burget, H. Sameti, O. Glembek, and O. Plchot, “Deep neural networks and hidden Markov models in i-vector-based text- dependent speaker verification,” in Proc. Odyssey-The Speaker and Language Recognition Workshop, 2016, pp. 24–30

  16. [16]

    HMM-based phrase-independent i-vector extractor for text-dependent speaker verification,

    H. Zeinali, H. Sameti, and L. Burget, “HMM-based phrase-independent i-vector extractor for text-dependent speaker verification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 7, pp. 1421–1435, 2017

  17. [17]

    Telephony text- prompted speaker verification using i-vector representation,

    H. Zeinali, E. Kalantari, H. Sameti, and H. Hadian, “Telephony text- prompted speaker verification using i-vector representation,” in Proc. ICASSP. IEEE, 2015, pp. 4839–4843

  18. [18]

    Self-attentive speaker embeddings for text-independent speaker verification,

    Y . Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentive speaker embeddings for text-independent speaker verification,”Proc. Interspeech 2018, pp. 3573–3577, 2018

  19. [19]

    Angular softmax for short-duration text- independent speaker verification,

    Z. Huang, S. Wang, and K. Yu, “Angular softmax for short-duration text- independent speaker verification,” Proc. Interspeech 2018 , pp. 3623– 3627, 2018

  20. [20]

    Text dependent speaker verification using a small devel- opment set,

    H. Aronowitz, “Text dependent speaker verification using a small devel- opment set,” in Proc. Odyssey-The Speaker and Language Recognition Workshop, 2012, pp. 312–316

  21. [21]

    Text- dependent GMM-JFA system for password based speaker verification,

    S. Novoselov, T. Pekhovsky, A. Shulipa, and A. Sholokhov, “Text- dependent GMM-JFA system for password based speaker verification,” in Proc. ICASSP. IEEE, 2014, pp. 729–737

  22. [22]

    An i-vector backend for speaker verification,

    P. Kenny, T. Stafylakis, J. Alam, and M. Kockmann, “An i-vector backend for speaker verification,” in Proc. Interspeech, 2015, pp. 2307– 2310

  23. [23]

    JFA for Speaker Recognition with Random Digit Strings,

    T. Stafylakis, P. Kenny, J. Alam, and M. Kockmann, “JFA for Speaker Recognition with Random Digit Strings,” in Proc. Interspeech, 2015

  24. [24]

    Text dependent speaker recog- nition with random digit strings,

    T. Stafylakis, J. Alam, and P. Kenny, “Text dependent speaker recog- nition with random digit strings,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 7, pp. 1194–1203, 2016

  25. [25]

    Fast scoring for plda with uncertainty propagation via i-vector grouping,

    W.-w. Lin, M.-W. Mak, and J.-T. Chien, “Fast scoring for plda with uncertainty propagation via i-vector grouping,” Computer Speech & Language, vol. 45, pp. 503–515, 2017

  26. [26]

    Fast scoring for plda with uncertainty propagation,

    W. Lin and M.-W. Mak, “Fast scoring for plda with uncertainty propagation,” in Proc. Odyssey-The Speaker and Language Recognition Workshop, 2016, pp. 31–38

  27. [27]

    Uncertainty propagation for noise robust speaker recognition: the case of NIST-SRE,

    D. Ribas, E. Vincent, and J. R. Calvo, “Uncertainty propagation for noise robust speaker recognition: the case of NIST-SRE,” in Proc. Interspeech, 2015

  28. [28]

    Accounting for uncertainty of i-vectors in speaker recognition using uncertainty propagation and modified imputation,

    R. Saeidi and P. Alku, “Accounting for uncertainty of i-vectors in speaker recognition using uncertainty propagation and modified imputation,” in Proc. Interspeech, 2015

  29. [29]

    Uncertain LDA: Including observation uncertainties in discriminative transforms,

    R. Saeidi, R. Astudillo, and D. Kolossa, “Uncertain LDA: Including observation uncertainties in discriminative transforms,” IEEE Transac- tions on Pattern Analysis and Machine Intelligence , vol. 38, no. 7, pp. 1479–1488, 2015

  30. [30]

    Speaker and channel factors in text-dependent speaker recognition,

    T. Stafylakis, P. Kenny, M. J. Alam, and M. Kockmann, “Speaker and channel factors in text-dependent speaker recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 1, pp. 65–78, 2016

  31. [31]

    SUT Submission for NIST 2016 Speaker Recognition Evaluation: Description and Analysis,

    H. Zeinali, H. Sameti, and N. Maghsoodi, “SUT Submission for NIST 2016 Speaker Recognition Evaluation: Description and Analysis,” in Proc. ROCLING, 2017. 11

  32. [32]

    Librispeech: an asr corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Proc. ICASSP . IEEE, 2015, pp. 5206–5210

  33. [33]

    Dnn i-vector speaker verifi- cation with short, text-constrained test utterances,

    J. Zhong, W. Hu, F. Soong, and H. Meng, “Dnn i-vector speaker verifi- cation with short, text-constrained test utterances,” in Proc. Interspeech, 2017, pp. 1507–1511

  34. [34]

    But 2014 babel system: Analysis of adaptation in nn based systems,

    M. Karafi ´at, F. Gr ´ezl, K. Vesel `y, M. Hannemann, I. Sz ˝oke, and J. ˇCernock`y, “But 2014 babel system: Analysis of adaptation in nn based systems,” in Proc. Interspeech, 2014

  35. [35]

    Text-dependent speaker verifi- cation based on i-vectors, neural networks and hidden markov models,

    H. Zeinali, H. Sameti, L. Burget et al., “Text-dependent speaker verifi- cation based on i-vectors, neural networks and hidden markov models,” Computer Speech & Language , vol. 46, pp. 53–71, 2017

  36. [36]

    MSR identity toolbox v1. 0: A MATLAB toolbox for speaker-recognition research,

    S. O. Sadjadi, M. Slaney, and L. Heck, “MSR identity toolbox v1. 0: A MATLAB toolbox for speaker-recognition research,” Speech and Language Processing Technical Committee Newsletter , vol. 1, no. 4, 2013

  37. [37]

    Analysis of i-vector length normalization in speaker recognition systems

    D. Garcia-Romero and C. Y . Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems.” in in Proc. Interspeech , 2011, pp. 249–252

  38. [38]

    The speakers in the wild (sitw) speaker recognition database

    M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The speakers in the wild (sitw) speaker recognition database.” in Interspeech, 2016, pp. 818–822

  39. [39]

    V oxceleb: a large-scale speaker identification dataset,

    A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612 , 2017

  40. [40]

    V oxceleb2: Deep speaker recognition,

    J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622 , 2018

  41. [41]

    Speaker verification using end-to-end adversarial language adaptation,

    J. Rohdin, T. Stafylakis, A. Silnova, H. Zeinali, L. Burget, and O. Plchot, “Speaker verification using end-to-end adversarial language adaptation,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 6006–6010

  42. [42]

    Cycle-gans for domain adaptation of acoustic features for speaker recognition,

    P. S. Nidadavolu, J. Villalba, and N. Dehak, “Cycle-gans for domain adaptation of acoustic features for speaker recognition,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 6206–6210

  43. [43]

    A novel scheme for speaker recognition using a phonetically-aware deep neural network,

    Y . Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in Proc. ICASSP. IEEE, 2014, pp. 1695–1699

  44. [44]

    The reddots data collection for speaker recognition,

    K. A. Lee, A. Larcher, G. Wang, P. Kenny, N. Br ¨ummer, D. v. Leeuwen, H. Aronowitz, M. Kockmann, C. Vaquero, B. Ma et al. , “The reddots data collection for speaker recognition,” in Sixteenth Annual Conference of the International Speech Communication Association , 2015

  45. [45]

    Analysis and opti- mization of bottleneck features for speaker recognition,

    A. Lozano-Diez, A. Silnova, P. Matejka, O. Glembek, O. Plchot, J. Pe ˇs´an, L. Burget, and J. Gonzalez-Rodriguez, “Analysis and opti- mization of bottleneck features for speaker recognition,” in Proceedings of Odyssey, vol. 2016, 2016, pp. 352–357

  46. [46]

    Tandem deep features for text- dependent speaker verification,

    T. Fu, Y . Qian, Y . Liu, and K. Yu, “Tandem deep features for text- dependent speaker verification,” in Fifteenth Annual Conference of the International Speech Communication Association , 2014

  47. [47]

    End-to-end attention based text-dependent speaker verification,

    S.-X. Zhang, Z. Chen, Y . Zhao, J. Li, and Y . Gong, “End-to-end attention based text-dependent speaker verification,” inSpoken Language Technology Workshop (SLT), 2016 IEEE . IEEE, 2016, pp. 171–178

  48. [48]

    Attention-Based Models for Text-Dependent Speaker Verification

    F. Chowdhury, Q. Wang, I. L. Moreno, and L. Wan, “Attention- based models for text-dependent speaker verification,” arXiv preprint arXiv:1710.10470, 2017

  49. [49]

    Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

    W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,” arXiv preprint arXiv:1804.05160, 2018

  50. [50]

    End-to-end dnn based speaker recognition inspired by i-vector and plda,

    J. Rohdin, A. Silnova, M. Diez, O. Plchot, P. Mat ˇejka, and L. Burget, “End-to-end dnn based speaker recognition inspired by i-vector and plda,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 4874–4878

  51. [51]

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning,

    Y . Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in International Conference on Machine Learning (ICML) , 2016, pp. 1050–1059. Nooshin Maghsoodi Nooshin Maghsoodi received the B.Sc. degree in Computer Engineering from Sharif Universiy of Technology and M.Sc. in Ar- tificial Intelligence from...