Speaker Recognition with Random Digit Strings Using Uncertainty Normalized HMM-based i-vectors
Pith reviewed 2026-05-24 21:49 UTC · model grok-4.3
The pith
Digit-specific HMMs enable per-digit i-vectors that reach 1.52% EER on random-digit speaker verification using only one training corpus.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Digit-specific HMMs segment utterances into digits and supply frame alignments for extracting Baum-Welch statistics; digit-specific i-vector extractors are then trained on those statistics so each i-vector models only one digit's phonetic content; uncertainty in the i-vector estimates is normalized before scoring; on RSR2015 part III this single system trained only on that corpus attains 1.52% EER for males and 1.77% EER for females using score-normalized cosine distance, outperforming x-vectors and showing only minor loss when channel compensation is omitted.
What carries the argument
Digit-specific HMMs that perform segmentation and state alignment, feeding per-digit i-vector extractors whose outputs receive uncertainty normalization.
If this is right
- Omission of channel compensation produces only minor performance loss, so the method does not require multiple handsets per speaker.
- The same pipeline applied to phrases on the RedDots corpus yields comparable gains over baselines.
- Fusion of the spectral i-vectors with bottleneck features produces additional error reduction.
- State-of-the-art results are obtained with a single system and simple cosine scoring rather than complex back-ends.
Where Pith is reading between the lines
- The per-digit localization may reduce sensitivity to phonetic mismatch in text-dependent tasks beyond digits.
- Uncertainty normalization could be tested on other embedding extractors to check whether the gain is specific to i-vectors.
- Because the method needs little channel diversity, it may suit deployment scenarios where only single-device enrollment data is available.
Load-bearing premise
Digit-specific HMMs trained on the same corpus can reliably segment random-digit utterances and produce frame alignments accurate enough for the per-digit i-vector extractors to remain well-localized.
What would settle it
Replace the HMM-derived alignments with random or cross-digit alignments and measure whether the reported EER on RSR2015 part III rises above 2.5% for both genders.
Figures
read the original abstract
In this paper, we combine Hidden Markov Models (HMMs) with i-vector extractors to address the problem of text-dependent speaker recognition with random digit strings. We employ digit-specific HMMs to segment the utterances into digits, to perform frame alignment to HMM states and to extract Baum-Welch statistics. By making use of the natural partition of input features into digits, we train digit-specific i-vector extractors on top of each HMM and we extract well-localized i-vectors, each modelling merely the phonetic content corresponding to a single digit. We then examine ways to perform channel and uncertainty compensation, and we propose a novel method for using the uncertainty in the i-vector estimates. The experiments on RSR2015 part III show that the proposed method attains 1.52\% and 1.77\% Equal Error Rate (EER) for male and female respectively, outperforming state-of-the-art methods such as x-vectors, trained on vast amounts of data. Furthermore, these results are attained by a single system trained entirely on RSR2015, and by a simple score-normalized cosine distance. Moreover, we show that the omission of channel compensation yields only a minor degradation in performance, meaning that the system attains state-of-the-art results even without recordings from multiple handsets per speaker for training or enrolment. Similar conclusions are drawn from our experiments on the RedDots corpus, where the same method is evaluated on phrases. Finally, we report results with bottleneck features and show that further improvement is attained when fusing them with spectral features.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes combining digit-specific HMMs with i-vector extractors for text-dependent speaker recognition on random digit strings. Digit-specific HMMs segment utterances, provide state alignments, and accumulate Baum-Welch statistics for training per-digit i-vector extractors that produce localized representations; a novel uncertainty normalization is introduced, followed by score-normalized cosine scoring. On RSR2015 part III the system reports 1.52% EER (male) and 1.77% EER (female), outperforming x-vectors trained on much larger data; similar conclusions are drawn on RedDots phrases, with only minor degradation when channel compensation is omitted and further gains when fusing bottleneck features.
Significance. If the central claims hold, the work demonstrates that a compact, single-system pipeline trained exclusively on RSR2015 can surpass data-intensive x-vector baselines while remaining robust to the absence of multi-handset channel data. The explicit use of phonetic partitioning via HMMs and the uncertainty-handling technique constitute concrete, falsifiable contributions that could influence practical text-dependent systems. The public-corpus evaluation protocol and the reported minor impact of channel compensation are reproducible strengths.
major comments (2)
- [Abstract and §3] Abstract and §3: The claim that each extracted i-vector 'modelling merely the phonetic content corresponding to a single digit' requires that digit-specific HMM alignments remain accurate under co-articulation and speaker variation. No alignment-error metrics, forced-alignment comparisons against reference transcriptions, or ablation removing the HMM segmentation step are reported; if alignments are noisy the subsequent localization and uncertainty normalization rest on an untested premise.
- [Abstract] Abstract: The reported EERs of 1.52% (male) and 1.77% (female) are presented as outperforming x-vectors without accompanying error bars, confidence intervals, or statistical significance tests across multiple training seeds or folds, weakening the strength of the outperformance claim.
minor comments (2)
- [Abstract] The abstract states that 'the omission of channel compensation yields only a minor degradation' but does not quantify the exact EER increase or identify the table/figure containing the comparison.
- [Abstract] Notation for the uncertainty normalization procedure is introduced without an explicit equation reference in the abstract; readers must wait until the methods section to locate the precise formulation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. We address each major comment below, indicating planned revisions to strengthen the manuscript where appropriate.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3: The claim that each extracted i-vector 'modelling merely the phonetic content corresponding to a single digit' requires that digit-specific HMM alignments remain accurate under co-articulation and speaker variation. No alignment-error metrics, forced-alignment comparisons against reference transcriptions, or ablation removing the HMM segmentation step are reported; if alignments are noisy the subsequent localization and uncertainty normalization rest on an untested premise.
Authors: We agree that explicit validation of alignment accuracy would strengthen the localization premise. The digit-specific HMMs are trained supervised on RSR2015 using the provided transcriptions, following standard practice for text-dependent tasks. The uncertainty normalization is designed to account for estimation variability that may include minor alignment effects. In the revised manuscript we will expand the discussion in §3 to address alignment robustness under co-articulation and speaker variation, and we will add a qualitative analysis of alignment stability on a subset of utterances. revision: partial
-
Referee: [Abstract] Abstract: The reported EERs of 1.52% (male) and 1.77% (female) are presented as outperforming x-vectors without accompanying error bars, confidence intervals, or statistical significance tests across multiple training seeds or folds, weakening the strength of the outperformance claim.
Authors: We acknowledge that the lack of error bars or significance tests weakens the quantitative strength of the outperformance statement. The reported figures follow the fixed, single-run protocol defined for RSR2015 part III; multiple random seeds were not explored due to computational cost. In the revised version we will qualify the abstract and results sections to note that the EERs are obtained from the standard single-run evaluation on this corpus and that the margin over the x-vector baseline is substantial. revision: partial
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper's pipeline (digit-specific HMMs for segmentation and Baum-Welch statistics feeding per-digit i-vector extractors, followed by uncertainty normalization and cosine scoring) relies on standard, externally established techniques without any reduction of reported EERs or claims to fitted parameters by construction, self-citation chains, or ansatz smuggling. No equations or steps in the provided text equate outputs to inputs via self-definition or renaming; results are presented as empirical outcomes on RSR2015 and RedDots, independent of the method's internal assumptions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Digit-specific HMMs trained on RSR2015 produce reliable state alignments for random digit strings
Reference graph
Works this paper leans on
-
[1]
Front-end factor analysis for speaker verification,
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011
work page 2011
-
[2]
Probabilistic linear discriminant analysis,
S. Ioffe, “Probabilistic linear discriminant analysis,” in Proc. Computer Vision–ECCV 2006. New York, NY , USA: Springer, 2006, pp. 531–542
work page 2006
-
[3]
Well- calibrated heavy tailed Bayesian speaker verification for microphone speech,
M. Senoussaoui, P. Kenny, P. Dumouchel, and F. Castaldo, “Well- calibrated heavy tailed Bayesian speaker verification for microphone speech,” in Proc. ICASSP. IEEE, 2011, pp. 4824–4827
work page 2011
-
[4]
The NIST speaker recognition evaluation–overview, methodology, sys- tems, results, perspective,
G. R. Doddington, M. A. Przybocki, A. F. Martin, and D. A. Reynolds, “The NIST speaker recognition evaluation–overview, methodology, sys- tems, results, perspective,” Speech Communication , vol. 31, no. 2, pp. 225–254, 2000
work page 2000
-
[5]
X- vectors: Robust dnn embeddings for speaker recognition,
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X- vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333
work page 2018
-
[6]
Speaker recognition for multi-speaker conversations using x-vectors,
D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, and S. Khudanpur, “Speaker recognition for multi-speaker conversations using x-vectors,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 5796–5800
work page 2019
-
[7]
Text-dependent speaker recognition using PLDA with uncertainty propagation,
T. Stafylakis, P. Kenny, P. Ouellet, J. Perez, M. Kockmann, and P. Dumouchel, “Text-dependent speaker recognition using PLDA with uncertainty propagation,” in Proc. Interspeech, 2013, pp. 3684–3688
work page 2013
-
[8]
PLDA for speaker verification with utterances of arbitrary duration,
P. Kenny, T. Stafylakis, P. Ouellet, M. J. Alam, and P. Dumouchel, “PLDA for speaker verification with utterances of arbitrary duration,” Proc. ICASSP, pp. 7649–7653, 2013
work page 2013
-
[9]
On the use of i-vector posterior distributions in probabilistic linear discriminant analysis,
S. Cumani, O. Plchot, and P. Laface, “On the use of i-vector posterior distributions in probabilistic linear discriminant analysis,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , vol. 22, no. 4, pp. 846–857, 2014
work page 2014
-
[10]
N. Evans, M. Sahidullah, J. Yamagishi, M. Todisco, K. A. Lee, H. Del- gado, T. Kinnunen et al. , “The 2nd automatic speaker verification spoofing and countermeasures challenge (asvspoof 2017) database,” 2017
work page 2017
-
[11]
ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection
M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, “Asvspoof 2019: Future horizons in spoofed and fake audio detection,” arXiv preprint arXiv:1904.05441 , 2019
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[12]
Text-dependent speaker ver- ification: Classifiers, databases and RSR2015,
A. Larcher, K. A. Lee, B. Ma, and H. Li, “Text-dependent speaker ver- ification: Classifiers, databases and RSR2015,” Speech Communication, vol. 60, pp. 56–77, 2014
work page 2014
-
[13]
The RSR2015: Database for text- dependent speaker verification using multiple pass-phrases,
A. Larcher, K. A. Lee, and B. Ma, “The RSR2015: Database for text- dependent speaker verification using multiple pass-phrases,” in Proc. Interspeech, 2012
work page 2012
-
[14]
Uncer- tainty modeling without subspace methods for text-dependent speaker recognition,
P. Kenny, T. Stafylakis, J. Alam, V . Gupta, and M. Kockmann, “Uncer- tainty modeling without subspace methods for text-dependent speaker recognition,” in Proc. Odyssey-The Speaker and Language Recognition Workshop, 2016, pp. 16–23
work page 2016
-
[15]
Deep neural networks and hidden Markov models in i-vector-based text- dependent speaker verification,
H. Zeinali, L. Burget, H. Sameti, O. Glembek, and O. Plchot, “Deep neural networks and hidden Markov models in i-vector-based text- dependent speaker verification,” in Proc. Odyssey-The Speaker and Language Recognition Workshop, 2016, pp. 24–30
work page 2016
-
[16]
HMM-based phrase-independent i-vector extractor for text-dependent speaker verification,
H. Zeinali, H. Sameti, and L. Burget, “HMM-based phrase-independent i-vector extractor for text-dependent speaker verification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 7, pp. 1421–1435, 2017
work page 2017
-
[17]
Telephony text- prompted speaker verification using i-vector representation,
H. Zeinali, E. Kalantari, H. Sameti, and H. Hadian, “Telephony text- prompted speaker verification using i-vector representation,” in Proc. ICASSP. IEEE, 2015, pp. 4839–4843
work page 2015
-
[18]
Self-attentive speaker embeddings for text-independent speaker verification,
Y . Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentive speaker embeddings for text-independent speaker verification,”Proc. Interspeech 2018, pp. 3573–3577, 2018
work page 2018
-
[19]
Angular softmax for short-duration text- independent speaker verification,
Z. Huang, S. Wang, and K. Yu, “Angular softmax for short-duration text- independent speaker verification,” Proc. Interspeech 2018 , pp. 3623– 3627, 2018
work page 2018
-
[20]
Text dependent speaker verification using a small devel- opment set,
H. Aronowitz, “Text dependent speaker verification using a small devel- opment set,” in Proc. Odyssey-The Speaker and Language Recognition Workshop, 2012, pp. 312–316
work page 2012
-
[21]
Text- dependent GMM-JFA system for password based speaker verification,
S. Novoselov, T. Pekhovsky, A. Shulipa, and A. Sholokhov, “Text- dependent GMM-JFA system for password based speaker verification,” in Proc. ICASSP. IEEE, 2014, pp. 729–737
work page 2014
-
[22]
An i-vector backend for speaker verification,
P. Kenny, T. Stafylakis, J. Alam, and M. Kockmann, “An i-vector backend for speaker verification,” in Proc. Interspeech, 2015, pp. 2307– 2310
work page 2015
-
[23]
JFA for Speaker Recognition with Random Digit Strings,
T. Stafylakis, P. Kenny, J. Alam, and M. Kockmann, “JFA for Speaker Recognition with Random Digit Strings,” in Proc. Interspeech, 2015
work page 2015
-
[24]
Text dependent speaker recog- nition with random digit strings,
T. Stafylakis, J. Alam, and P. Kenny, “Text dependent speaker recog- nition with random digit strings,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 7, pp. 1194–1203, 2016
work page 2016
-
[25]
Fast scoring for plda with uncertainty propagation via i-vector grouping,
W.-w. Lin, M.-W. Mak, and J.-T. Chien, “Fast scoring for plda with uncertainty propagation via i-vector grouping,” Computer Speech & Language, vol. 45, pp. 503–515, 2017
work page 2017
-
[26]
Fast scoring for plda with uncertainty propagation,
W. Lin and M.-W. Mak, “Fast scoring for plda with uncertainty propagation,” in Proc. Odyssey-The Speaker and Language Recognition Workshop, 2016, pp. 31–38
work page 2016
-
[27]
Uncertainty propagation for noise robust speaker recognition: the case of NIST-SRE,
D. Ribas, E. Vincent, and J. R. Calvo, “Uncertainty propagation for noise robust speaker recognition: the case of NIST-SRE,” in Proc. Interspeech, 2015
work page 2015
-
[28]
R. Saeidi and P. Alku, “Accounting for uncertainty of i-vectors in speaker recognition using uncertainty propagation and modified imputation,” in Proc. Interspeech, 2015
work page 2015
-
[29]
Uncertain LDA: Including observation uncertainties in discriminative transforms,
R. Saeidi, R. Astudillo, and D. Kolossa, “Uncertain LDA: Including observation uncertainties in discriminative transforms,” IEEE Transac- tions on Pattern Analysis and Machine Intelligence , vol. 38, no. 7, pp. 1479–1488, 2015
work page 2015
-
[30]
Speaker and channel factors in text-dependent speaker recognition,
T. Stafylakis, P. Kenny, M. J. Alam, and M. Kockmann, “Speaker and channel factors in text-dependent speaker recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 1, pp. 65–78, 2016
work page 2016
-
[31]
SUT Submission for NIST 2016 Speaker Recognition Evaluation: Description and Analysis,
H. Zeinali, H. Sameti, and N. Maghsoodi, “SUT Submission for NIST 2016 Speaker Recognition Evaluation: Description and Analysis,” in Proc. ROCLING, 2017. 11
work page 2016
-
[32]
Librispeech: an asr corpus based on public domain audio books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Proc. ICASSP . IEEE, 2015, pp. 5206–5210
work page 2015
-
[33]
Dnn i-vector speaker verifi- cation with short, text-constrained test utterances,
J. Zhong, W. Hu, F. Soong, and H. Meng, “Dnn i-vector speaker verifi- cation with short, text-constrained test utterances,” in Proc. Interspeech, 2017, pp. 1507–1511
work page 2017
-
[34]
But 2014 babel system: Analysis of adaptation in nn based systems,
M. Karafi ´at, F. Gr ´ezl, K. Vesel `y, M. Hannemann, I. Sz ˝oke, and J. ˇCernock`y, “But 2014 babel system: Analysis of adaptation in nn based systems,” in Proc. Interspeech, 2014
work page 2014
-
[35]
Text-dependent speaker verifi- cation based on i-vectors, neural networks and hidden markov models,
H. Zeinali, H. Sameti, L. Burget et al., “Text-dependent speaker verifi- cation based on i-vectors, neural networks and hidden markov models,” Computer Speech & Language , vol. 46, pp. 53–71, 2017
work page 2017
-
[36]
MSR identity toolbox v1. 0: A MATLAB toolbox for speaker-recognition research,
S. O. Sadjadi, M. Slaney, and L. Heck, “MSR identity toolbox v1. 0: A MATLAB toolbox for speaker-recognition research,” Speech and Language Processing Technical Committee Newsletter , vol. 1, no. 4, 2013
work page 2013
-
[37]
Analysis of i-vector length normalization in speaker recognition systems
D. Garcia-Romero and C. Y . Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems.” in in Proc. Interspeech , 2011, pp. 249–252
work page 2011
-
[38]
The speakers in the wild (sitw) speaker recognition database
M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The speakers in the wild (sitw) speaker recognition database.” in Interspeech, 2016, pp. 818–822
work page 2016
-
[39]
V oxceleb: a large-scale speaker identification dataset,
A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612 , 2017
-
[40]
V oxceleb2: Deep speaker recognition,
J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622 , 2018
-
[41]
Speaker verification using end-to-end adversarial language adaptation,
J. Rohdin, T. Stafylakis, A. Silnova, H. Zeinali, L. Burget, and O. Plchot, “Speaker verification using end-to-end adversarial language adaptation,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 6006–6010
work page 2019
-
[42]
Cycle-gans for domain adaptation of acoustic features for speaker recognition,
P. S. Nidadavolu, J. Villalba, and N. Dehak, “Cycle-gans for domain adaptation of acoustic features for speaker recognition,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 6206–6210
work page 2019
-
[43]
A novel scheme for speaker recognition using a phonetically-aware deep neural network,
Y . Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in Proc. ICASSP. IEEE, 2014, pp. 1695–1699
work page 2014
-
[44]
The reddots data collection for speaker recognition,
K. A. Lee, A. Larcher, G. Wang, P. Kenny, N. Br ¨ummer, D. v. Leeuwen, H. Aronowitz, M. Kockmann, C. Vaquero, B. Ma et al. , “The reddots data collection for speaker recognition,” in Sixteenth Annual Conference of the International Speech Communication Association , 2015
work page 2015
-
[45]
Analysis and opti- mization of bottleneck features for speaker recognition,
A. Lozano-Diez, A. Silnova, P. Matejka, O. Glembek, O. Plchot, J. Pe ˇs´an, L. Burget, and J. Gonzalez-Rodriguez, “Analysis and opti- mization of bottleneck features for speaker recognition,” in Proceedings of Odyssey, vol. 2016, 2016, pp. 352–357
work page 2016
-
[46]
Tandem deep features for text- dependent speaker verification,
T. Fu, Y . Qian, Y . Liu, and K. Yu, “Tandem deep features for text- dependent speaker verification,” in Fifteenth Annual Conference of the International Speech Communication Association , 2014
work page 2014
-
[47]
End-to-end attention based text-dependent speaker verification,
S.-X. Zhang, Z. Chen, Y . Zhao, J. Li, and Y . Gong, “End-to-end attention based text-dependent speaker verification,” inSpoken Language Technology Workshop (SLT), 2016 IEEE . IEEE, 2016, pp. 171–178
work page 2016
-
[48]
Attention-Based Models for Text-Dependent Speaker Verification
F. Chowdhury, Q. Wang, I. L. Moreno, and L. Wan, “Attention- based models for text-dependent speaker verification,” arXiv preprint arXiv:1710.10470, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[49]
Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System
W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,” arXiv preprint arXiv:1804.05160, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[50]
End-to-end dnn based speaker recognition inspired by i-vector and plda,
J. Rohdin, A. Silnova, M. Diez, O. Plchot, P. Mat ˇejka, and L. Burget, “End-to-end dnn based speaker recognition inspired by i-vector and plda,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 4874–4878
work page 2018
-
[51]
Dropout as a bayesian approximation: Representing model uncertainty in deep learning,
Y . Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in International Conference on Machine Learning (ICML) , 2016, pp. 1050–1059. Nooshin Maghsoodi Nooshin Maghsoodi received the B.Sc. degree in Computer Engineering from Sharif Universiy of Technology and M.Sc. in Ar- tificial Intelligence from...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.