Speaker Recognition with Random Digit Strings Using Uncertainty Normalized HMM-based i-vectors

Hossein Sameti; Hossein Zeinali; Nooshin Maghsoodi; Themos~Stafylakis

arxiv: 1907.06111 · v1 · pith:SB6GPRJZnew · submitted 2019-07-13 · 📡 eess.AS · cs.CL· cs.SD

Speaker Recognition with Random Digit Strings Using Uncertainty Normalized HMM-based i-vectors

Nooshin Maghsoodi , Hossein Sameti , Hossein Zeinali , Themos~Stafylakis This is my paper

Pith reviewed 2026-05-24 21:49 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.SD

keywords speaker recognitioni-vectorsHMMtext-dependent verificationrandom digitsuncertainty normalizationRSR2015RedDots

0 comments

The pith

Digit-specific HMMs enable per-digit i-vectors that reach 1.52% EER on random-digit speaker verification using only one training corpus.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that hidden Markov models tied to individual digits can segment random-digit strings, align frames to states, and feed localized statistics into separate i-vector extractors for each digit. This produces i-vectors that model only the phonetic content of a single digit rather than mixing across an utterance. A new uncertainty normalization step is introduced to handle variability in those estimates, and the resulting system is scored with plain cosine distance after simple normalization. The approach yields lower error rates than x-vector systems trained on far larger datasets while requiring no multi-handset recordings per speaker.

Core claim

Digit-specific HMMs segment utterances into digits and supply frame alignments for extracting Baum-Welch statistics; digit-specific i-vector extractors are then trained on those statistics so each i-vector models only one digit's phonetic content; uncertainty in the i-vector estimates is normalized before scoring; on RSR2015 part III this single system trained only on that corpus attains 1.52% EER for males and 1.77% EER for females using score-normalized cosine distance, outperforming x-vectors and showing only minor loss when channel compensation is omitted.

What carries the argument

Digit-specific HMMs that perform segmentation and state alignment, feeding per-digit i-vector extractors whose outputs receive uncertainty normalization.

If this is right

Omission of channel compensation produces only minor performance loss, so the method does not require multiple handsets per speaker.
The same pipeline applied to phrases on the RedDots corpus yields comparable gains over baselines.
Fusion of the spectral i-vectors with bottleneck features produces additional error reduction.
State-of-the-art results are obtained with a single system and simple cosine scoring rather than complex back-ends.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The per-digit localization may reduce sensitivity to phonetic mismatch in text-dependent tasks beyond digits.
Uncertainty normalization could be tested on other embedding extractors to check whether the gain is specific to i-vectors.
Because the method needs little channel diversity, it may suit deployment scenarios where only single-device enrollment data is available.

Load-bearing premise

Digit-specific HMMs trained on the same corpus can reliably segment random-digit utterances and produce frame alignments accurate enough for the per-digit i-vector extractors to remain well-localized.

What would settle it

Replace the HMM-derived alignments with random or cross-digit alignments and measure whether the reported EER on RSR2015 part III rises above 2.5% for both genders.

Figures

Figures reproduced from arXiv: 1907.06111 by Hossein Sameti, Hossein Zeinali, Nooshin Maghsoodi, Themos~Stafylakis.

**Figure 2.** Figure 2: DET curves for the proposed methods for female speakers. The trends [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

In this paper, we combine Hidden Markov Models (HMMs) with i-vector extractors to address the problem of text-dependent speaker recognition with random digit strings. We employ digit-specific HMMs to segment the utterances into digits, to perform frame alignment to HMM states and to extract Baum-Welch statistics. By making use of the natural partition of input features into digits, we train digit-specific i-vector extractors on top of each HMM and we extract well-localized i-vectors, each modelling merely the phonetic content corresponding to a single digit. We then examine ways to perform channel and uncertainty compensation, and we propose a novel method for using the uncertainty in the i-vector estimates. The experiments on RSR2015 part III show that the proposed method attains 1.52\% and 1.77\% Equal Error Rate (EER) for male and female respectively, outperforming state-of-the-art methods such as x-vectors, trained on vast amounts of data. Furthermore, these results are attained by a single system trained entirely on RSR2015, and by a simple score-normalized cosine distance. Moreover, we show that the omission of channel compensation yields only a minor degradation in performance, meaning that the system attains state-of-the-art results even without recordings from multiple handsets per speaker for training or enrolment. Similar conclusions are drawn from our experiments on the RedDots corpus, where the same method is evaluated on phrases. Finally, we report results with bottleneck features and show that further improvement is attained when fusing them with spectral features.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets low EERs on random-digit verification by splitting i-vectors per digit via HMM alignment plus uncertainty normalization, but the alignment accuracy is assumed without direct checks.

read the letter

Here's the quick read on that speaker recognition paper with the random digits. They train digit-specific HMMs on RSR2015 to cut the utterances into digits, align frames to states, and pull out Baum-Welch stats. Then they build a separate i-vector extractor for each digit so each one only sees the content from one digit. After that they add uncertainty normalization on the i-vectors. On the test set this gets them down to 1.52% EER for men and 1.77% for women, which beats x-vectors even though everything is trained on the same small corpus and they just use cosine scoring after score normalization. Skipping channel compensation barely moves the needle, which is handy if you don't have multi-handset recordings. They get similar patterns on RedDots phrases too. The actual novelty is that per-digit extractor setup plus the uncertainty handling. It looks like a reasonable way to localize the modeling without needing tons of data. The part that feels thin is the assumption that the HMM segmentation is good enough. The abstract lays out the steps but never shows how accurate the digit boundaries or state alignments are, or what happens if you skip the HMM and just use something else. Co-articulation could blur things, and without an ablation or error measure it's hard to know how much the localization is really helping. The EER numbers also come without any error bars or details on whether tuning touched the test set. This is squarely for the text-dependent speaker verification crowd in speech processing. A colleague working on short-phrase or digit-string tasks would pick up the pipeline and the uncertainty trick. It has real experiments on public sets and a working system, so it should go to referees rather than get desk-rejected. They can sort out the alignment question during review.

Referee Report

2 major / 2 minor

Summary. The paper proposes combining digit-specific HMMs with i-vector extractors for text-dependent speaker recognition on random digit strings. Digit-specific HMMs segment utterances, provide state alignments, and accumulate Baum-Welch statistics for training per-digit i-vector extractors that produce localized representations; a novel uncertainty normalization is introduced, followed by score-normalized cosine scoring. On RSR2015 part III the system reports 1.52% EER (male) and 1.77% EER (female), outperforming x-vectors trained on much larger data; similar conclusions are drawn on RedDots phrases, with only minor degradation when channel compensation is omitted and further gains when fusing bottleneck features.

Significance. If the central claims hold, the work demonstrates that a compact, single-system pipeline trained exclusively on RSR2015 can surpass data-intensive x-vector baselines while remaining robust to the absence of multi-handset channel data. The explicit use of phonetic partitioning via HMMs and the uncertainty-handling technique constitute concrete, falsifiable contributions that could influence practical text-dependent systems. The public-corpus evaluation protocol and the reported minor impact of channel compensation are reproducible strengths.

major comments (2)

[Abstract and §3] Abstract and §3: The claim that each extracted i-vector 'modelling merely the phonetic content corresponding to a single digit' requires that digit-specific HMM alignments remain accurate under co-articulation and speaker variation. No alignment-error metrics, forced-alignment comparisons against reference transcriptions, or ablation removing the HMM segmentation step are reported; if alignments are noisy the subsequent localization and uncertainty normalization rest on an untested premise.
[Abstract] Abstract: The reported EERs of 1.52% (male) and 1.77% (female) are presented as outperforming x-vectors without accompanying error bars, confidence intervals, or statistical significance tests across multiple training seeds or folds, weakening the strength of the outperformance claim.

minor comments (2)

[Abstract] The abstract states that 'the omission of channel compensation yields only a minor degradation' but does not quantify the exact EER increase or identify the table/figure containing the comparison.
[Abstract] Notation for the uncertainty normalization procedure is introduced without an explicit equation reference in the abstract; readers must wait until the methods section to locate the precise formulation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below, indicating planned revisions to strengthen the manuscript where appropriate.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3: The claim that each extracted i-vector 'modelling merely the phonetic content corresponding to a single digit' requires that digit-specific HMM alignments remain accurate under co-articulation and speaker variation. No alignment-error metrics, forced-alignment comparisons against reference transcriptions, or ablation removing the HMM segmentation step are reported; if alignments are noisy the subsequent localization and uncertainty normalization rest on an untested premise.

Authors: We agree that explicit validation of alignment accuracy would strengthen the localization premise. The digit-specific HMMs are trained supervised on RSR2015 using the provided transcriptions, following standard practice for text-dependent tasks. The uncertainty normalization is designed to account for estimation variability that may include minor alignment effects. In the revised manuscript we will expand the discussion in §3 to address alignment robustness under co-articulation and speaker variation, and we will add a qualitative analysis of alignment stability on a subset of utterances. revision: partial
Referee: [Abstract] Abstract: The reported EERs of 1.52% (male) and 1.77% (female) are presented as outperforming x-vectors without accompanying error bars, confidence intervals, or statistical significance tests across multiple training seeds or folds, weakening the strength of the outperformance claim.

Authors: We acknowledge that the lack of error bars or significance tests weakens the quantitative strength of the outperformance statement. The reported figures follow the fixed, single-run protocol defined for RSR2015 part III; multiple random seeds were not explored due to computational cost. In the revised version we will qualify the abstract and results sections to note that the EERs are obtained from the standard single-run evaluation on this corpus and that the margin over the x-vector baseline is substantial. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's pipeline (digit-specific HMMs for segmentation and Baum-Welch statistics feeding per-digit i-vector extractors, followed by uncertainty normalization and cosine scoring) relies on standard, externally established techniques without any reduction of reported EERs or claims to fitted parameters by construction, self-citation chains, or ansatz smuggling. No equations or steps in the provided text equate outputs to inputs via self-definition or renaming; results are presented as empirical outcomes on RSR2015 and RedDots, independent of the method's internal assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; the method implicitly relies on standard HMM and i-vector assumptions plus the unstated claim that digit boundaries can be recovered accurately enough from the same limited corpus. No explicit free parameters or invented entities are named.

axioms (1)

domain assumption Digit-specific HMMs trained on RSR2015 produce reliable state alignments for random digit strings
Invoked in the description of segmentation and Baum-Welch statistics extraction

pith-pipeline@v0.9.0 · 5833 in / 1348 out tokens · 16822 ms · 2026-05-24T21:49:11.799425+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 3 internal anchors

[1]

Front-end factor analysis for speaker veriﬁcation,

N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker veriﬁcation,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011

work page 2011
[2]

Probabilistic linear discriminant analysis,

S. Ioffe, “Probabilistic linear discriminant analysis,” in Proc. Computer Vision–ECCV 2006. New York, NY , USA: Springer, 2006, pp. 531–542

work page 2006
[3]

Well- calibrated heavy tailed Bayesian speaker veriﬁcation for microphone speech,

M. Senoussaoui, P. Kenny, P. Dumouchel, and F. Castaldo, “Well- calibrated heavy tailed Bayesian speaker veriﬁcation for microphone speech,” in Proc. ICASSP. IEEE, 2011, pp. 4824–4827

work page 2011
[4]

The NIST speaker recognition evaluation–overview, methodology, sys- tems, results, perspective,

G. R. Doddington, M. A. Przybocki, A. F. Martin, and D. A. Reynolds, “The NIST speaker recognition evaluation–overview, methodology, sys- tems, results, perspective,” Speech Communication , vol. 31, no. 2, pp. 225–254, 2000

work page 2000
[5]

X- vectors: Robust dnn embeddings for speaker recognition,

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X- vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333

work page 2018
[6]

Speaker recognition for multi-speaker conversations using x-vectors,

D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, and S. Khudanpur, “Speaker recognition for multi-speaker conversations using x-vectors,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 5796–5800

work page 2019
[7]

Text-dependent speaker recognition using PLDA with uncertainty propagation,

T. Stafylakis, P. Kenny, P. Ouellet, J. Perez, M. Kockmann, and P. Dumouchel, “Text-dependent speaker recognition using PLDA with uncertainty propagation,” in Proc. Interspeech, 2013, pp. 3684–3688

work page 2013
[8]

PLDA for speaker veriﬁcation with utterances of arbitrary duration,

P. Kenny, T. Stafylakis, P. Ouellet, M. J. Alam, and P. Dumouchel, “PLDA for speaker veriﬁcation with utterances of arbitrary duration,” Proc. ICASSP, pp. 7649–7653, 2013

work page 2013
[9]

On the use of i-vector posterior distributions in probabilistic linear discriminant analysis,

S. Cumani, O. Plchot, and P. Laface, “On the use of i-vector posterior distributions in probabilistic linear discriminant analysis,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , vol. 22, no. 4, pp. 846–857, 2014

work page 2014
[10]

The 2nd automatic speaker veriﬁcation spooﬁng and countermeasures challenge (asvspoof 2017) database,

N. Evans, M. Sahidullah, J. Yamagishi, M. Todisco, K. A. Lee, H. Del- gado, T. Kinnunen et al. , “The 2nd automatic speaker veriﬁcation spooﬁng and countermeasures challenge (asvspoof 2017) database,” 2017

work page 2017
[11]

ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection

M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, “Asvspoof 2019: Future horizons in spoofed and fake audio detection,” arXiv preprint arXiv:1904.05441 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019
[12]

Text-dependent speaker ver- iﬁcation: Classiﬁers, databases and RSR2015,

A. Larcher, K. A. Lee, B. Ma, and H. Li, “Text-dependent speaker ver- iﬁcation: Classiﬁers, databases and RSR2015,” Speech Communication, vol. 60, pp. 56–77, 2014

work page 2014
[13]

The RSR2015: Database for text- dependent speaker veriﬁcation using multiple pass-phrases,

A. Larcher, K. A. Lee, and B. Ma, “The RSR2015: Database for text- dependent speaker veriﬁcation using multiple pass-phrases,” in Proc. Interspeech, 2012

work page 2012
[14]

Uncer- tainty modeling without subspace methods for text-dependent speaker recognition,

P. Kenny, T. Stafylakis, J. Alam, V . Gupta, and M. Kockmann, “Uncer- tainty modeling without subspace methods for text-dependent speaker recognition,” in Proc. Odyssey-The Speaker and Language Recognition Workshop, 2016, pp. 16–23

work page 2016
[15]

Deep neural networks and hidden Markov models in i-vector-based text- dependent speaker veriﬁcation,

H. Zeinali, L. Burget, H. Sameti, O. Glembek, and O. Plchot, “Deep neural networks and hidden Markov models in i-vector-based text- dependent speaker veriﬁcation,” in Proc. Odyssey-The Speaker and Language Recognition Workshop, 2016, pp. 24–30

work page 2016
[16]

HMM-based phrase-independent i-vector extractor for text-dependent speaker veriﬁcation,

H. Zeinali, H. Sameti, and L. Burget, “HMM-based phrase-independent i-vector extractor for text-dependent speaker veriﬁcation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 7, pp. 1421–1435, 2017

work page 2017
[17]

Telephony text- prompted speaker veriﬁcation using i-vector representation,

H. Zeinali, E. Kalantari, H. Sameti, and H. Hadian, “Telephony text- prompted speaker veriﬁcation using i-vector representation,” in Proc. ICASSP. IEEE, 2015, pp. 4839–4843

work page 2015
[18]

Self-attentive speaker embeddings for text-independent speaker veriﬁcation,

Y . Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentive speaker embeddings for text-independent speaker veriﬁcation,”Proc. Interspeech 2018, pp. 3573–3577, 2018

work page 2018
[19]

Angular softmax for short-duration text- independent speaker veriﬁcation,

Z. Huang, S. Wang, and K. Yu, “Angular softmax for short-duration text- independent speaker veriﬁcation,” Proc. Interspeech 2018 , pp. 3623– 3627, 2018

work page 2018
[20]

Text dependent speaker veriﬁcation using a small devel- opment set,

H. Aronowitz, “Text dependent speaker veriﬁcation using a small devel- opment set,” in Proc. Odyssey-The Speaker and Language Recognition Workshop, 2012, pp. 312–316

work page 2012
[21]

Text- dependent GMM-JFA system for password based speaker veriﬁcation,

S. Novoselov, T. Pekhovsky, A. Shulipa, and A. Sholokhov, “Text- dependent GMM-JFA system for password based speaker veriﬁcation,” in Proc. ICASSP. IEEE, 2014, pp. 729–737

work page 2014
[22]

An i-vector backend for speaker veriﬁcation,

P. Kenny, T. Stafylakis, J. Alam, and M. Kockmann, “An i-vector backend for speaker veriﬁcation,” in Proc. Interspeech, 2015, pp. 2307– 2310

work page 2015
[23]

JFA for Speaker Recognition with Random Digit Strings,

T. Stafylakis, P. Kenny, J. Alam, and M. Kockmann, “JFA for Speaker Recognition with Random Digit Strings,” in Proc. Interspeech, 2015

work page 2015
[24]

Text dependent speaker recog- nition with random digit strings,

T. Stafylakis, J. Alam, and P. Kenny, “Text dependent speaker recog- nition with random digit strings,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 7, pp. 1194–1203, 2016

work page 2016
[25]

Fast scoring for plda with uncertainty propagation via i-vector grouping,

W.-w. Lin, M.-W. Mak, and J.-T. Chien, “Fast scoring for plda with uncertainty propagation via i-vector grouping,” Computer Speech & Language, vol. 45, pp. 503–515, 2017

work page 2017
[26]

Fast scoring for plda with uncertainty propagation,

W. Lin and M.-W. Mak, “Fast scoring for plda with uncertainty propagation,” in Proc. Odyssey-The Speaker and Language Recognition Workshop, 2016, pp. 31–38

work page 2016
[27]

Uncertainty propagation for noise robust speaker recognition: the case of NIST-SRE,

D. Ribas, E. Vincent, and J. R. Calvo, “Uncertainty propagation for noise robust speaker recognition: the case of NIST-SRE,” in Proc. Interspeech, 2015

work page 2015
[28]

Accounting for uncertainty of i-vectors in speaker recognition using uncertainty propagation and modiﬁed imputation,

R. Saeidi and P. Alku, “Accounting for uncertainty of i-vectors in speaker recognition using uncertainty propagation and modiﬁed imputation,” in Proc. Interspeech, 2015

work page 2015
[29]

Uncertain LDA: Including observation uncertainties in discriminative transforms,

R. Saeidi, R. Astudillo, and D. Kolossa, “Uncertain LDA: Including observation uncertainties in discriminative transforms,” IEEE Transac- tions on Pattern Analysis and Machine Intelligence , vol. 38, no. 7, pp. 1479–1488, 2015

work page 2015
[30]

Speaker and channel factors in text-dependent speaker recognition,

T. Stafylakis, P. Kenny, M. J. Alam, and M. Kockmann, “Speaker and channel factors in text-dependent speaker recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 1, pp. 65–78, 2016

work page 2016
[31]

SUT Submission for NIST 2016 Speaker Recognition Evaluation: Description and Analysis,

H. Zeinali, H. Sameti, and N. Maghsoodi, “SUT Submission for NIST 2016 Speaker Recognition Evaluation: Description and Analysis,” in Proc. ROCLING, 2017. 11

work page 2016
[32]

Librispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Proc. ICASSP . IEEE, 2015, pp. 5206–5210

work page 2015
[33]

Dnn i-vector speaker veriﬁ- cation with short, text-constrained test utterances,

J. Zhong, W. Hu, F. Soong, and H. Meng, “Dnn i-vector speaker veriﬁ- cation with short, text-constrained test utterances,” in Proc. Interspeech, 2017, pp. 1507–1511

work page 2017
[34]

But 2014 babel system: Analysis of adaptation in nn based systems,

M. Karaﬁ ´at, F. Gr ´ezl, K. Vesel `y, M. Hannemann, I. Sz ˝oke, and J. ˇCernock`y, “But 2014 babel system: Analysis of adaptation in nn based systems,” in Proc. Interspeech, 2014

work page 2014
[35]

Text-dependent speaker veriﬁ- cation based on i-vectors, neural networks and hidden markov models,

H. Zeinali, H. Sameti, L. Burget et al., “Text-dependent speaker veriﬁ- cation based on i-vectors, neural networks and hidden markov models,” Computer Speech & Language , vol. 46, pp. 53–71, 2017

work page 2017
[36]

MSR identity toolbox v1. 0: A MATLAB toolbox for speaker-recognition research,

S. O. Sadjadi, M. Slaney, and L. Heck, “MSR identity toolbox v1. 0: A MATLAB toolbox for speaker-recognition research,” Speech and Language Processing Technical Committee Newsletter , vol. 1, no. 4, 2013

work page 2013
[37]

Analysis of i-vector length normalization in speaker recognition systems

D. Garcia-Romero and C. Y . Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems.” in in Proc. Interspeech , 2011, pp. 249–252

work page 2011
[38]

The speakers in the wild (sitw) speaker recognition database

M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The speakers in the wild (sitw) speaker recognition database.” in Interspeech, 2016, pp. 818–822

work page 2016
[39]

V oxceleb: a large-scale speaker identiﬁcation dataset,

A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large-scale speaker identiﬁcation dataset,” arXiv preprint arXiv:1706.08612 , 2017

work page arXiv 2017
[40]

V oxceleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622 , 2018

work page arXiv 2018
[41]

Speaker veriﬁcation using end-to-end adversarial language adaptation,

J. Rohdin, T. Stafylakis, A. Silnova, H. Zeinali, L. Burget, and O. Plchot, “Speaker veriﬁcation using end-to-end adversarial language adaptation,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 6006–6010

work page 2019
[42]

Cycle-gans for domain adaptation of acoustic features for speaker recognition,

P. S. Nidadavolu, J. Villalba, and N. Dehak, “Cycle-gans for domain adaptation of acoustic features for speaker recognition,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 6206–6210

work page 2019
[43]

A novel scheme for speaker recognition using a phonetically-aware deep neural network,

Y . Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in Proc. ICASSP. IEEE, 2014, pp. 1695–1699

work page 2014
[44]

The reddots data collection for speaker recognition,

K. A. Lee, A. Larcher, G. Wang, P. Kenny, N. Br ¨ummer, D. v. Leeuwen, H. Aronowitz, M. Kockmann, C. Vaquero, B. Ma et al. , “The reddots data collection for speaker recognition,” in Sixteenth Annual Conference of the International Speech Communication Association , 2015

work page 2015
[45]

Analysis and opti- mization of bottleneck features for speaker recognition,

A. Lozano-Diez, A. Silnova, P. Matejka, O. Glembek, O. Plchot, J. Pe ˇs´an, L. Burget, and J. Gonzalez-Rodriguez, “Analysis and opti- mization of bottleneck features for speaker recognition,” in Proceedings of Odyssey, vol. 2016, 2016, pp. 352–357

work page 2016
[46]

Tandem deep features for text- dependent speaker veriﬁcation,

T. Fu, Y . Qian, Y . Liu, and K. Yu, “Tandem deep features for text- dependent speaker veriﬁcation,” in Fifteenth Annual Conference of the International Speech Communication Association , 2014

work page 2014
[47]

End-to-end attention based text-dependent speaker veriﬁcation,

S.-X. Zhang, Z. Chen, Y . Zhao, J. Li, and Y . Gong, “End-to-end attention based text-dependent speaker veriﬁcation,” inSpoken Language Technology Workshop (SLT), 2016 IEEE . IEEE, 2016, pp. 171–178

work page 2016
[48]

Attention-Based Models for Text-Dependent Speaker Verification

F. Chowdhury, Q. Wang, I. L. Moreno, and L. Wan, “Attention- based models for text-dependent speaker veriﬁcation,” arXiv preprint arXiv:1710.10470, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[49]

Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,” arXiv preprint arXiv:1804.05160, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[50]

End-to-end dnn based speaker recognition inspired by i-vector and plda,

J. Rohdin, A. Silnova, M. Diez, O. Plchot, P. Mat ˇejka, and L. Burget, “End-to-end dnn based speaker recognition inspired by i-vector and plda,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 4874–4878

work page 2018
[51]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning,

Y . Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in International Conference on Machine Learning (ICML) , 2016, pp. 1050–1059. Nooshin Maghsoodi Nooshin Maghsoodi received the B.Sc. degree in Computer Engineering from Sharif Universiy of Technology and M.Sc. in Ar- tiﬁcial Intelligence from...

work page 2016

[1] [1]

Front-end factor analysis for speaker veriﬁcation,

N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker veriﬁcation,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011

work page 2011

[2] [2]

Probabilistic linear discriminant analysis,

S. Ioffe, “Probabilistic linear discriminant analysis,” in Proc. Computer Vision–ECCV 2006. New York, NY , USA: Springer, 2006, pp. 531–542

work page 2006

[3] [3]

Well- calibrated heavy tailed Bayesian speaker veriﬁcation for microphone speech,

M. Senoussaoui, P. Kenny, P. Dumouchel, and F. Castaldo, “Well- calibrated heavy tailed Bayesian speaker veriﬁcation for microphone speech,” in Proc. ICASSP. IEEE, 2011, pp. 4824–4827

work page 2011

[4] [4]

The NIST speaker recognition evaluation–overview, methodology, sys- tems, results, perspective,

G. R. Doddington, M. A. Przybocki, A. F. Martin, and D. A. Reynolds, “The NIST speaker recognition evaluation–overview, methodology, sys- tems, results, perspective,” Speech Communication , vol. 31, no. 2, pp. 225–254, 2000

work page 2000

[5] [5]

X- vectors: Robust dnn embeddings for speaker recognition,

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X- vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333

work page 2018

[6] [6]

Speaker recognition for multi-speaker conversations using x-vectors,

D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, and S. Khudanpur, “Speaker recognition for multi-speaker conversations using x-vectors,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 5796–5800

work page 2019

[7] [7]

Text-dependent speaker recognition using PLDA with uncertainty propagation,

T. Stafylakis, P. Kenny, P. Ouellet, J. Perez, M. Kockmann, and P. Dumouchel, “Text-dependent speaker recognition using PLDA with uncertainty propagation,” in Proc. Interspeech, 2013, pp. 3684–3688

work page 2013

[8] [8]

PLDA for speaker veriﬁcation with utterances of arbitrary duration,

P. Kenny, T. Stafylakis, P. Ouellet, M. J. Alam, and P. Dumouchel, “PLDA for speaker veriﬁcation with utterances of arbitrary duration,” Proc. ICASSP, pp. 7649–7653, 2013

work page 2013

[9] [9]

On the use of i-vector posterior distributions in probabilistic linear discriminant analysis,

S. Cumani, O. Plchot, and P. Laface, “On the use of i-vector posterior distributions in probabilistic linear discriminant analysis,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , vol. 22, no. 4, pp. 846–857, 2014

work page 2014

[10] [10]

The 2nd automatic speaker veriﬁcation spooﬁng and countermeasures challenge (asvspoof 2017) database,

N. Evans, M. Sahidullah, J. Yamagishi, M. Todisco, K. A. Lee, H. Del- gado, T. Kinnunen et al. , “The 2nd automatic speaker veriﬁcation spooﬁng and countermeasures challenge (asvspoof 2017) database,” 2017

work page 2017

[11] [11]

ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection

M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, “Asvspoof 2019: Future horizons in spoofed and fake audio detection,” arXiv preprint arXiv:1904.05441 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019

[12] [12]

Text-dependent speaker ver- iﬁcation: Classiﬁers, databases and RSR2015,

A. Larcher, K. A. Lee, B. Ma, and H. Li, “Text-dependent speaker ver- iﬁcation: Classiﬁers, databases and RSR2015,” Speech Communication, vol. 60, pp. 56–77, 2014

work page 2014

[13] [13]

The RSR2015: Database for text- dependent speaker veriﬁcation using multiple pass-phrases,

A. Larcher, K. A. Lee, and B. Ma, “The RSR2015: Database for text- dependent speaker veriﬁcation using multiple pass-phrases,” in Proc. Interspeech, 2012

work page 2012

[14] [14]

Uncer- tainty modeling without subspace methods for text-dependent speaker recognition,

P. Kenny, T. Stafylakis, J. Alam, V . Gupta, and M. Kockmann, “Uncer- tainty modeling without subspace methods for text-dependent speaker recognition,” in Proc. Odyssey-The Speaker and Language Recognition Workshop, 2016, pp. 16–23

work page 2016

[15] [15]

Deep neural networks and hidden Markov models in i-vector-based text- dependent speaker veriﬁcation,

H. Zeinali, L. Burget, H. Sameti, O. Glembek, and O. Plchot, “Deep neural networks and hidden Markov models in i-vector-based text- dependent speaker veriﬁcation,” in Proc. Odyssey-The Speaker and Language Recognition Workshop, 2016, pp. 24–30

work page 2016

[16] [16]

HMM-based phrase-independent i-vector extractor for text-dependent speaker veriﬁcation,

H. Zeinali, H. Sameti, and L. Burget, “HMM-based phrase-independent i-vector extractor for text-dependent speaker veriﬁcation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 7, pp. 1421–1435, 2017

work page 2017

[17] [17]

Telephony text- prompted speaker veriﬁcation using i-vector representation,

H. Zeinali, E. Kalantari, H. Sameti, and H. Hadian, “Telephony text- prompted speaker veriﬁcation using i-vector representation,” in Proc. ICASSP. IEEE, 2015, pp. 4839–4843

work page 2015

[18] [18]

Self-attentive speaker embeddings for text-independent speaker veriﬁcation,

Y . Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentive speaker embeddings for text-independent speaker veriﬁcation,”Proc. Interspeech 2018, pp. 3573–3577, 2018

work page 2018

[19] [19]

Angular softmax for short-duration text- independent speaker veriﬁcation,

Z. Huang, S. Wang, and K. Yu, “Angular softmax for short-duration text- independent speaker veriﬁcation,” Proc. Interspeech 2018 , pp. 3623– 3627, 2018

work page 2018

[20] [20]

Text dependent speaker veriﬁcation using a small devel- opment set,

H. Aronowitz, “Text dependent speaker veriﬁcation using a small devel- opment set,” in Proc. Odyssey-The Speaker and Language Recognition Workshop, 2012, pp. 312–316

work page 2012

[21] [21]

Text- dependent GMM-JFA system for password based speaker veriﬁcation,

S. Novoselov, T. Pekhovsky, A. Shulipa, and A. Sholokhov, “Text- dependent GMM-JFA system for password based speaker veriﬁcation,” in Proc. ICASSP. IEEE, 2014, pp. 729–737

work page 2014

[22] [22]

An i-vector backend for speaker veriﬁcation,

P. Kenny, T. Stafylakis, J. Alam, and M. Kockmann, “An i-vector backend for speaker veriﬁcation,” in Proc. Interspeech, 2015, pp. 2307– 2310

work page 2015

[23] [23]

JFA for Speaker Recognition with Random Digit Strings,

T. Stafylakis, P. Kenny, J. Alam, and M. Kockmann, “JFA for Speaker Recognition with Random Digit Strings,” in Proc. Interspeech, 2015

work page 2015

[24] [24]

Text dependent speaker recog- nition with random digit strings,

T. Stafylakis, J. Alam, and P. Kenny, “Text dependent speaker recog- nition with random digit strings,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 7, pp. 1194–1203, 2016

work page 2016

[25] [25]

Fast scoring for plda with uncertainty propagation via i-vector grouping,

W.-w. Lin, M.-W. Mak, and J.-T. Chien, “Fast scoring for plda with uncertainty propagation via i-vector grouping,” Computer Speech & Language, vol. 45, pp. 503–515, 2017

work page 2017

[26] [26]

Fast scoring for plda with uncertainty propagation,

W. Lin and M.-W. Mak, “Fast scoring for plda with uncertainty propagation,” in Proc. Odyssey-The Speaker and Language Recognition Workshop, 2016, pp. 31–38

work page 2016

[27] [27]

Uncertainty propagation for noise robust speaker recognition: the case of NIST-SRE,

D. Ribas, E. Vincent, and J. R. Calvo, “Uncertainty propagation for noise robust speaker recognition: the case of NIST-SRE,” in Proc. Interspeech, 2015

work page 2015

[28] [28]

Accounting for uncertainty of i-vectors in speaker recognition using uncertainty propagation and modiﬁed imputation,

R. Saeidi and P. Alku, “Accounting for uncertainty of i-vectors in speaker recognition using uncertainty propagation and modiﬁed imputation,” in Proc. Interspeech, 2015

work page 2015

[29] [29]

Uncertain LDA: Including observation uncertainties in discriminative transforms,

R. Saeidi, R. Astudillo, and D. Kolossa, “Uncertain LDA: Including observation uncertainties in discriminative transforms,” IEEE Transac- tions on Pattern Analysis and Machine Intelligence , vol. 38, no. 7, pp. 1479–1488, 2015

work page 2015

[30] [30]

Speaker and channel factors in text-dependent speaker recognition,

T. Stafylakis, P. Kenny, M. J. Alam, and M. Kockmann, “Speaker and channel factors in text-dependent speaker recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 1, pp. 65–78, 2016

work page 2016

[31] [31]

SUT Submission for NIST 2016 Speaker Recognition Evaluation: Description and Analysis,

H. Zeinali, H. Sameti, and N. Maghsoodi, “SUT Submission for NIST 2016 Speaker Recognition Evaluation: Description and Analysis,” in Proc. ROCLING, 2017. 11

work page 2016

[32] [32]

Librispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Proc. ICASSP . IEEE, 2015, pp. 5206–5210

work page 2015

[33] [33]

Dnn i-vector speaker veriﬁ- cation with short, text-constrained test utterances,

J. Zhong, W. Hu, F. Soong, and H. Meng, “Dnn i-vector speaker veriﬁ- cation with short, text-constrained test utterances,” in Proc. Interspeech, 2017, pp. 1507–1511

work page 2017

[34] [34]

But 2014 babel system: Analysis of adaptation in nn based systems,

M. Karaﬁ ´at, F. Gr ´ezl, K. Vesel `y, M. Hannemann, I. Sz ˝oke, and J. ˇCernock`y, “But 2014 babel system: Analysis of adaptation in nn based systems,” in Proc. Interspeech, 2014

work page 2014

[35] [35]

Text-dependent speaker veriﬁ- cation based on i-vectors, neural networks and hidden markov models,

H. Zeinali, H. Sameti, L. Burget et al., “Text-dependent speaker veriﬁ- cation based on i-vectors, neural networks and hidden markov models,” Computer Speech & Language , vol. 46, pp. 53–71, 2017

work page 2017

[36] [36]

MSR identity toolbox v1. 0: A MATLAB toolbox for speaker-recognition research,

S. O. Sadjadi, M. Slaney, and L. Heck, “MSR identity toolbox v1. 0: A MATLAB toolbox for speaker-recognition research,” Speech and Language Processing Technical Committee Newsletter , vol. 1, no. 4, 2013

work page 2013

[37] [37]

Analysis of i-vector length normalization in speaker recognition systems

D. Garcia-Romero and C. Y . Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems.” in in Proc. Interspeech , 2011, pp. 249–252

work page 2011

[38] [38]

The speakers in the wild (sitw) speaker recognition database

M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The speakers in the wild (sitw) speaker recognition database.” in Interspeech, 2016, pp. 818–822

work page 2016

[39] [39]

V oxceleb: a large-scale speaker identiﬁcation dataset,

A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large-scale speaker identiﬁcation dataset,” arXiv preprint arXiv:1706.08612 , 2017

work page arXiv 2017

[40] [40]

V oxceleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622 , 2018

work page arXiv 2018

[41] [41]

Speaker veriﬁcation using end-to-end adversarial language adaptation,

J. Rohdin, T. Stafylakis, A. Silnova, H. Zeinali, L. Burget, and O. Plchot, “Speaker veriﬁcation using end-to-end adversarial language adaptation,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 6006–6010

work page 2019

[42] [42]

Cycle-gans for domain adaptation of acoustic features for speaker recognition,

P. S. Nidadavolu, J. Villalba, and N. Dehak, “Cycle-gans for domain adaptation of acoustic features for speaker recognition,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 6206–6210

work page 2019

[43] [43]

A novel scheme for speaker recognition using a phonetically-aware deep neural network,

Y . Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in Proc. ICASSP. IEEE, 2014, pp. 1695–1699

work page 2014

[44] [44]

The reddots data collection for speaker recognition,

K. A. Lee, A. Larcher, G. Wang, P. Kenny, N. Br ¨ummer, D. v. Leeuwen, H. Aronowitz, M. Kockmann, C. Vaquero, B. Ma et al. , “The reddots data collection for speaker recognition,” in Sixteenth Annual Conference of the International Speech Communication Association , 2015

work page 2015

[45] [45]

Analysis and opti- mization of bottleneck features for speaker recognition,

A. Lozano-Diez, A. Silnova, P. Matejka, O. Glembek, O. Plchot, J. Pe ˇs´an, L. Burget, and J. Gonzalez-Rodriguez, “Analysis and opti- mization of bottleneck features for speaker recognition,” in Proceedings of Odyssey, vol. 2016, 2016, pp. 352–357

work page 2016

[46] [46]

Tandem deep features for text- dependent speaker veriﬁcation,

T. Fu, Y . Qian, Y . Liu, and K. Yu, “Tandem deep features for text- dependent speaker veriﬁcation,” in Fifteenth Annual Conference of the International Speech Communication Association , 2014

work page 2014

[47] [47]

End-to-end attention based text-dependent speaker veriﬁcation,

S.-X. Zhang, Z. Chen, Y . Zhao, J. Li, and Y . Gong, “End-to-end attention based text-dependent speaker veriﬁcation,” inSpoken Language Technology Workshop (SLT), 2016 IEEE . IEEE, 2016, pp. 171–178

work page 2016

[48] [48]

Attention-Based Models for Text-Dependent Speaker Verification

F. Chowdhury, Q. Wang, I. L. Moreno, and L. Wan, “Attention- based models for text-dependent speaker veriﬁcation,” arXiv preprint arXiv:1710.10470, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[49] [49]

Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,” arXiv preprint arXiv:1804.05160, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[50] [50]

End-to-end dnn based speaker recognition inspired by i-vector and plda,

J. Rohdin, A. Silnova, M. Diez, O. Plchot, P. Mat ˇejka, and L. Burget, “End-to-end dnn based speaker recognition inspired by i-vector and plda,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 4874–4878

work page 2018

[51] [51]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning,

Y . Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in International Conference on Machine Learning (ICML) , 2016, pp. 1050–1059. Nooshin Maghsoodi Nooshin Maghsoodi received the B.Sc. degree in Computer Engineering from Sharif Universiy of Technology and M.Sc. in Ar- tiﬁcial Intelligence from...

work page 2016