pith. machine review for the scientific record. sign in

arxiv: 2605.12387 · v1 · submitted 2026-05-12 · 💻 cs.SD · cs.LG

Recognition: no theorem link

A Semi-Supervised Framework for Speech Confidence Detection using Whisper

Adam Wynn, Jingyun Wang

Pith reviewed 2026-05-13 03:54 UTC · model grok-4.3

classification 💻 cs.SD cs.LG
keywords speech confidence detectionsemi-supervised learningWhisper modeleGeMAPS featurespseudo-labellingprosodic analysisaudio embeddingsparalinguistic detection
0
0 comments X

The pith

Fusing Whisper embeddings with acoustic prosodic features in a semi-supervised setup improves speaker confidence detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to detect how confident a speaker sounds using limited labelled data. It combines rich semantic information from the Whisper speech model with simpler acoustic measurements like pitch variation and signs of stress. By carefully adding pseudo-labels from unlabelled audio only when the model is sure, the system reaches better accuracy than using either deep models or acoustic features alone. This matters because many applications need to sense speaker state but lack enough annotated examples. The results show that the added features help especially with harder cases like low-confidence speech.

Core claim

The hybrid semi-supervised framework fuses deep semantic embeddings from the Whisper encoder with an interpretable acoustic feature vector of eGeMAPS descriptors and auxiliary probability estimates of vocal stress and disfluency. Using an Uncertainty-Aware Pseudo-Labelling strategy to select high-quality samples from unlabelled data, the approach achieves a Macro-F1 score of 0.751, outperforming self-supervised baselines such as WavLM, HuBERT, and Wav2Vec 2.0, and improves the minority class by 3% over the unimodal Whisper baseline.

What carries the argument

The Uncertainty-Aware Pseudo-Labelling strategy combined with fusion of Whisper encoder embeddings and eGeMAPS plus prosodic auxiliary features, which supplies corrective acoustic signals missing from deep semantic representations alone.

If this is right

  • Explicit prosodic and auxiliary features correct for information lost in deep semantic representations.
  • High-quality curated pseudo-labels outperform indiscriminate large-scale data augmentation.
  • The hybrid model surpasses self-supervised audio models like WavLM, HuBERT, and Wav2Vec 2.0.
  • Data quality matters more than quantity for perceived confidence detection tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar fusion approaches could improve detection of other speaker states such as emotion or engagement in low-data settings.
  • The emphasis on uncertainty in pseudo-labelling may help other speech classification tasks with subjective labels.
  • Testing the framework on real-world applications like virtual assistants could reveal practical benefits for adaptive responses.

Load-bearing premise

That the added acoustic features supply information not already captured in the Whisper embeddings and that the uncertainty-aware method picks unbiased high-quality pseudo-labels.

What would settle it

Running the same experiments on an independent dataset and observing no gain in Macro-F1 score or minority class performance from adding the acoustic features or the pseudo-labelling step.

Figures

Figures reproduced from arXiv: 2605.12387 by Adam Wynn, Jingyun Wang.

Figure 1
Figure 1. Figure 1: Overall Pipeline of the Confidence Classification System during Training for Fold k [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: User Interface of the Labelling System As previously mentioned, to ensure rigorous evaluation without data leakage, the entire pipeline operates under a consistent stratified 5-fold cross-validation framework. The dataset DL is partitioned into 5 folds once and these splits remain fixed throughout the experiment. The training folds used to build the pseudo-label model (D (k) L T rainV al) are the exact sam… view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of Auxiliary Models - Disfluency and Stress Classifiers [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Architecture of MLP Labeller prior to Mel-spectrogram extraction. The combined dataset provided a diverse set of emotional speech instances across multiple speakers and recording conditions, enhancing the model’s robustness. c) Model Architecture: The stress classification model follows a similar architecture to the disfluency model, us￾ing HuggingFace’s WhisperForAudioClassification implemen￾tation to dis… view at source ↗
Figure 6
Figure 6. Figure 6: Confusion Matrix for Hybrid Confidence Model [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: t-SNE visualisation of test set embeddings using the ensemble model. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: SHAP Feature Importance Analysis for Low, Medium and High Confidence. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Impact of Data Strategy. Comparing Ground-Truth (GT) Only against [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
read the original abstract

Automatic detection of speaker confidence is critical for adaptive computing but remains constrained by limited labelled data and the subjectivity of paralinguistic annotations. This paper proposes a semi-supervised hybrid framework that fuses deep semantic embeddings from the Whisper encoder with an interpretable acoustic feature vector composed of eGeMAPS descriptors and auxiliary probability estimates of vocal stress and disfluency. To mitigate reliance on scarce ground truth data, we introduce an Uncertainty-Aware Pseudo-Labelling strategy where a model generates labels for unlabelled data, retaining only high-quality samples for training. Experimental results demonstrate that the proposed approach achieves a Macro-F1 score of 0.751, outperforming self-supervised baselines, including WavLM, HuBERT, and Wav2Vec 2.0. The hybrid architecture also surpasses the unimodal Whisper baseline, yielding a 3\% improvement in the minority class, confirming that explicit prosodic and auxiliary features provide necessary corrective signals which are otherwise lost in deep semantic representations. Ablation studies further show that a curated set of high confidence pseudo-labels outperforms indiscriminate large scale augmentation, confirming that data quality outweighs quantity for perceived confidence detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 4 minor

Summary. The paper proposes a semi-supervised hybrid framework for automatic detection of speaker confidence in speech. It fuses Whisper encoder embeddings with an interpretable acoustic feature vector (eGeMAPS descriptors plus auxiliary probabilities for vocal stress and disfluency). An Uncertainty-Aware Pseudo-Labelling strategy generates and filters pseudo-labels from unlabelled data, retaining only high-quality samples. The method reports a Macro-F1 of 0.751, outperforming self-supervised baselines (WavLM, HuBERT, Wav2Vec 2.0) and the unimodal Whisper baseline (with a 3% gain on the minority class). Ablations indicate that curated high-confidence pseudo-labels outperform indiscriminate large-scale augmentation.

Significance. If the performance claims hold after validation, the work would demonstrate a practical way to combine deep semantic representations with hand-crafted prosodic features for subjective paralinguistic tasks under label scarcity. The emphasis on filtering for pseudo-label quality rather than scale offers a transferable insight for semi-supervised speech classification.

major comments (2)
  1. [Uncertainty-Aware Pseudo-Labelling strategy (methods and experiments)] The headline Macro-F1 of 0.751 and the 3% minority-class improvement rest on the Uncertainty-Aware Pseudo-Labelling component. No held-out validation of pseudo-label accuracy against human ground truth (e.g., precision, recall, or confusion matrix on retained samples) is reported. Because confidence is a subjective trait with known annotator disagreement, the uncertainty filter may retain the model's own biased predictions rather than genuinely high-quality labels, undermining the claim that the hybrid acoustic features supply corrective signals absent from Whisper embeddings.
  2. [Ablation studies (experiments)] The ablation comparing curated high-confidence pseudo-labels to indiscriminate augmentation does not substitute for an external accuracy check on the pseudo-labels themselves. Without such a check, it remains possible that the reported gains arise from an altered training distribution rather than from the added prosodic features.
minor comments (4)
  1. [Abstract and experimental setup] The abstract and experimental sections provide no dataset name, size, train/test split, class distribution, or annotation protocol, making it impossible to assess the reliability of the reported Macro-F1 and minority-class results.
  2. [Experimental results] No statistical significance tests (e.g., McNemar or paired t-tests) are reported for the performance differences versus baselines or the unimodal Whisper model.
  3. [Methods] Implementation details are absent: the exact definition of uncertainty used for filtering, the threshold value, the proportion of unlabelled data retained, and the training hyperparameters for the hybrid model.
  4. [Feature extraction] The paper does not discuss how the auxiliary vocal-stress and disfluency probability estimates are obtained or whether they are derived from the same Whisper model or separate modules.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of validating the semi-supervised component, and we address each point below with planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Uncertainty-Aware Pseudo-Labelling strategy (methods and experiments)] The headline Macro-F1 of 0.751 and the 3% minority-class improvement rest on the Uncertainty-Aware Pseudo-Labelling component. No held-out validation of pseudo-label accuracy against human ground truth (e.g., precision, recall, or confusion matrix on retained samples) is reported. Because confidence is a subjective trait with known annotator disagreement, the uncertainty filter may retain the model's own biased predictions rather than genuinely high-quality labels, undermining the claim that the hybrid acoustic features supply corrective signals absent from Whisper embeddings.

    Authors: We agree that direct validation of pseudo-label accuracy against human annotations would provide stronger evidence for the quality of the retained samples. The unlabelled data lacks ground-truth labels by design, which is the core motivation for the semi-supervised setting; obtaining new human annotations for a held-out subset would require additional resources not available in the current study. The Uncertainty-Aware Pseudo-Labelling uses model uncertainty to filter samples, and the reported ablations demonstrate that high-confidence selection yields better performance than indiscriminate augmentation. In the revised manuscript, we will expand the methods section to detail the uncertainty estimation procedure and add a limitations discussion addressing potential biases arising from subjectivity. We will also include an analysis correlating uncertainty scores with downstream test-set performance as a proxy validation. revision: partial

  2. Referee: [Ablation studies (experiments)] The ablation comparing curated high-confidence pseudo-labels to indiscriminate augmentation does not substitute for an external accuracy check on the pseudo-labels themselves. Without such a check, it remains possible that the reported gains arise from an altered training distribution rather than from the added prosodic features.

    Authors: The ablation isolates the effect of pseudo-label curation by holding the model architecture, features, and training procedure fixed while varying only the selection strategy. The main results separately demonstrate the benefit of the hybrid acoustic features over the Whisper-only baseline under identical pseudo-labelling conditions. To clarify this distinction, we will revise the experiments section to add an explicit ablation that applies the same high-confidence pseudo-labelling regime both with and without the eGeMAPS plus auxiliary features, thereby showing that the performance lift from the hybrid component is not solely attributable to distribution shift. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external baselines and ablations

full rationale

The paper reports an empirical Macro-F1 of 0.751 on (presumably held-out) test data, with direct comparisons to independent external models (WavLM, HuBERT, Wav2Vec 2.0) and an ablation contrasting curated high-confidence pseudo-labels versus indiscriminate augmentation. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described claims. The Uncertainty-Aware Pseudo-Labelling strategy is presented as a methodological choice whose value is checked by ablation against quantity-based augmentation; the final performance numbers are not forced by construction from the training inputs themselves. This is a standard supervised/semi-supervised evaluation setup against external benchmarks, so the derivation chain does not reduce to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the framework builds on standard components (Whisper encoder, eGeMAPS features) from prior literature without detailing new fitted quantities or assumptions.

pith-pipeline@v0.9.0 · 5495 in / 1085 out tokens · 133227 ms · 2026-05-13T03:54:20.220060+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 3 internal anchors

  1. [1]

    Social psychological models of interpersonal communication,

    R. M. Krauss and S. R. Fussell, “Social psychological models of interpersonal communication,” inSocial psychology: Handbook of basic principles, E. T. Higgins and A. W. Kruglanski, Eds. New York, NY: Guilford Press, 1996, pp. 655–701

  2. [2]

    Effects of self-confidence and diction on speaking skills in junior high school students,

    M. Mardiana, B. Laksmana, and S. Sukardi, “Effects of self-confidence and diction on speaking skills in junior high school students,”Indo- Fintech Intellectuals: Journal of Economics and Business, vol. 4, no. 4, pp. 1333–1344, Aug. 2024

  3. [3]

    Speech rate, intonation, and pitch: Investigating the bias and cue effects of vocal confidence on persuasion,

    J. J. Guyer, L. R. Fabrigar, and T. I. Vaughan-Johnston, “Speech rate, intonation, and pitch: Investigating the bias and cue effects of vocal confidence on persuasion,”Personality and Social Psychology Bulletin, vol. 45, no. 3, pp. 389–405, 2019

  4. [4]

    Automatic feedback in online learning environments: A systematic literature review,

    A. P. Cavalcanti, A. Barbosa, R. Carvalho, F. Freitas, Y .-S. Tsai, D. Ga ˇsevi´c, and R. F. Mello, “Automatic feedback in online learning environments: A systematic literature review,”Computers and Educa- tion: Artificial Intelligence, vol. 2, p. 100027, 2021

  5. [5]

    A cognitive model of social phobia,

    D. M. Clark and A. Wells, “A cognitive model of social phobia,” in Social phobia: Diagnosis, assessment, and treatment, R. G. Heimberg and M. R. Liebowitz, Eds. New York: Guilford Press, 1995, pp. 69–93

  6. [6]

    Encoding and decoding confidence information in speech,

    X. Jiang and M. Pell, “Encoding and decoding confidence information in speech,” inProc. Speech Prosody 2014, 2014, pp. 573–576

  7. [7]

    Recognizing uncertainty in speech,

    H. Pon-Barry and S. M. Shieber, “Recognizing uncertainty in speech,” EURASIP Journal on Advances in Signal Processing, vol. 2011, no. 1, Dec. 2010

  8. [8]

    On finding the best learning model for assessing confidence in speech,

    S. Nair, M. Mohan, J. Rajesh, and P. Chandran, “On finding the best learning model for assessing confidence in speech,” in2020 The 3rd In- ternational Conference on Machine Learning and Machine Intelligence, ser. MLMI ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 58–64

  9. [9]

    Ted-lium 3: Twice as much data and corpus repartition for experiments on speaker adaptation,

    F. Hernandez, V . Nguyen, S. Ghannay, N. Tomashenko, and Y . Esteve, “Ted-lium 3: Twice as much data and corpus repartition for experiments on speaker adaptation,” inSpeech and Computer: 20th International Conference, SPECOM 2018, Leipzig, Germany, September 18–22, 2018, Proceedings 20. Springer, 2018, pp. 198–208

  10. [10]

    Sep-28k: A dataset for stuttering event detection from podcasts with people who stutter,

    C. Lea, V . Mitra, A. Joshi, S. Kajarekar, and J. Bigham, “Sep-28k: A dataset for stuttering event detection from podcasts with people who stutter,” inICASSP, 2021. [Online]. Available: https://arxiv.org/pdf/2102.12394.pdf

  11. [11]

    Multimodal senti- ment intensity analysis in videos: Facial gestures and verbal messages,

    A. Zadeh, R. Zellers, E. Pincus, and L.-P. Morency, “Multimodal senti- ment intensity analysis in videos: Facial gestures and verbal messages,” IEEE Intelligent Systems, vol. 31, no. 6, pp. 82–88, 2016

  12. [12]

    The people’s speech: A large-scale diverse english speech recognition dataset for commercial usage,

    D. Galvez, G. Diamos, J. Ciro, J. F. Cer ´on, K. Achorn, A. Gopi, D. Kanter, M. Lam, M. Mazumder, and V . J. Reddi, “The people’s speech: A large-scale diverse english speech recognition dataset for commercial usage,”CoRR, 2021. [Online]. Available: https://arxiv.org/abs/2111.09344

  13. [13]

    Semi-supervised speech confidence detection using pseudo-labelling and whisper embeddings,

    A. Wynn, J. Wang, and X. Tan, “Semi-supervised speech confidence detection using pseudo-labelling and whisper embeddings,” inArtificial Intelligence in Education. Cham: Springer Nature Switzerland, 2025, pp. 266–274

  14. [14]

    Robust Speech Recognition via Large-Scale Weak Supervision

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022. [Online]. Available: https://arxiv.org/abs/2212.04356 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

  15. [15]

    Wav2vec 2.0: A framework for self-supervised learning of speech representations,

    A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS’20. Red Hook, NY , USA: Curran Associates Inc., 2020

  16. [16]

    Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks,

    D.-H. Lee, “Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks,”ICML 2013 Workshop : Challenges in Representation Learning (WREPL), 07 2013

  17. [17]

    The geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,

    F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andre, C. Busso, L. Y . Devillers, J. Epps, P. Laukka, S. S. Narayanan, and K. P. Truong, “The geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,”IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, Apr. 2016. [Online]. Available: https:/...

  18. [18]

    Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament,

    A. Mehrabian, “Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament,” Current Psychology, vol. 14, no. 4, p. 261–292, Dec. 1996. [Online]. Available: http://dx.doi.org/10.1007/BF02686918

  19. [19]

    Evidence for a three-factor theory of emotions,

    J. A. Russell and A. Mehrabian, “Evidence for a three-factor theory of emotions,”Journal of research in Personality, vol. 11, no. 3, pp. 273– 294, 1977

  20. [20]

    Set the tone: Trustworthy and dominant novel voices classification using explicit judgement and machine learning techniques,

    C. Chappuis and D. Grandjean, “Set the tone: Trustworthy and dominant novel voices classification using explicit judgement and machine learning techniques,”PLOS ONE, vol. 17, no. 6, p. e0267432, Jun. 2022. [Online]. Available: http://dx.doi.org/10.1371/journal.pone.0267432

  21. [21]

    The sound of confidence and doubt,

    X. Jiang and M. D. Pell, “The sound of confidence and doubt,”Speech Communication, vol. 88, pp. 106–126, 2017

  22. [22]

    Robocop: A robotic coach for oral presentations,

    H. Trinh, R. Asadi, D. Edge, and T. Bickmore, “Robocop: A robotic coach for oral presentations,”Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., vol. 1, no. 2, jun 2017

  23. [23]

    A deep audiovisual approach for human confidence classification,

    S. Chanda, K. Fitwe, G. Deshpande, B. W. Schuller, and S. Patel, “A deep audiovisual approach for human confidence classification,” Frontiers in Computer Science, vol. 3, 2021. [Online]. Available: https://www.frontiersin.org/articles/10.3389/fcomp.2021.674533

  24. [24]

    Speech disfluency and gestures production in undergraduate students’confidence level of speaking,

    N. L. E. Astuti, N. N. Padmadewi, and I. N. A. J. Putra, “Speech disfluency and gestures production in undergraduate students’confidence level of speaking,”Media Bina Ilmiah, vol. 19, no. 4, p. 4453, 2024

  25. [25]

    Fluency bank: A new re- source for fluency research and practice,

    N. Bernstein Ratner and B. MacWhinney, “Fluency bank: A new re- source for fluency research and practice,”Journal of Fluency Disorders, vol. 56, pp. 69–80, 2018

  26. [26]

    Detecting multiple speech disfluencies using a deep residual network with bidirectional long short- term memory,

    T. Kourkounakis, A. Hajavi, and A. Etemad, “Detecting multiple speech disfluencies using a deep residual network with bidirectional long short- term memory,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6089–6093

  27. [27]

    Detecting speech disor- ders using a machine-learning guided method in spontaneous tunisian dialect speech,

    E. Boughariou, Y . Bahou, and L. H. Belguith, “Detecting speech disor- ders using a machine-learning guided method in spontaneous tunisian dialect speech,”SN Computer Science, vol. 5, no. 5, Apr. 2024

  28. [28]

    Speech disfluency detection with contextual representation and data distillation,

    P. Mohapatra, A. Pandey, B. Islam, and Q. Zhu, “Speech disfluency detection with contextual representation and data distillation,” inPro- ceedings of the 1st ACM International Workshop on Intelligent Acoustic Systems and Applications, ser. IASA ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 19–24

  29. [29]

    Automatic speech disfluency detection using wav2vec2.0 for different languages with variable lengths,

    J. Liu, A. Wumaier, D. Wei, and S. Guo, “Automatic speech disfluency detection using wav2vec2.0 for different languages with variable lengths,”Applied Sciences, vol. 13, no. 13, 2023. [Online]. Available: https://www.mdpi.com/2076-3417/13/13/7579

  30. [30]

    Whisper in focus: En- hancing stuttered speech classification with encoder layer optimization,

    H. Ameer, S. Latif, R. Latif, and S. Mukhtar, “Whisper in focus: En- hancing stuttered speech classification with encoder layer optimization,” 2023

  31. [31]

    Mfcc and its applications in speaker recognition,

    V . Tiwari, “Mfcc and its applications in speaker recognition,”

  32. [32]

    Available: https://api.semanticscholar.org/CorpusID: 212584631

    [Online]. Available: https://api.semanticscholar.org/CorpusID: 212584631

  33. [33]

    MFCC in audio signal processing for voice disorder: a review,

    M. S. Sidhu, N. A. A. Latib, and K. K. Sidhu, “MFCC in audio signal processing for voice disorder: a review,”Multimed. Tools Appl., 2024

  34. [34]

    Automatic detection of alzheimers disease using spontaneous speech only,

    J. Chen, J. Ye, F. Tang, and J. Zhou, “Automatic detection of alzheimers disease using spontaneous speech only,” inInterspeech 2021. ISCA, Aug. 2021

  35. [35]

    Emotion Recognition from Speech Using wav2vec 2.0 Embeddings,

    L. Pepino, P. Riera, and L. Ferrer, “Emotion Recognition from Speech Using wav2vec 2.0 Embeddings,” inProc. Interspeech 2021, 2021, pp. 3400–3404

  36. [36]

    Dawn of the transformer era in speech emotion recognition: closing the valence gap,

    J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben, and B. W. Schuller, “Dawn of the transformer era in speech emotion recognition: closing the valence gap,” 2022. [Online]. Available: https://arxiv.org/abs/2203.07378

  37. [37]

    Improving domain general- ization in speech emotion recognition with whisper,

    E. Goron, L. Asai, E. Rut, and M. Dinov, “Improving domain general- ization in speech emotion recognition with whisper,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 11 631–11 635

  38. [38]

    Ser evals: In-domain and out-of-domain benchmarking for speech emotion recognition,

    M. Osman, D. Z. Kaplan, and T. Nadeem, “Ser evals: In-domain and out-of-domain benchmarking for speech emotion recognition,” 2024. [Online]. Available: https://arxiv.org/abs/2408.07851

  39. [39]

    Pseudo-labeling and confirmation bias in deep semi-supervised learning,

    E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, and K. McGuinness, “Pseudo-labeling and confirmation bias in deep semi-supervised learning,” 2020. [Online]. Available: https://arxiv.org/abs/1908.02983

  40. [40]

    D., Kurakin, A., Zhang, H., and Raffel, C

    K. Sohn, D. Berthelot, C.-L. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” 2020. [Online]. Available: https://arxiv.org/abs/2001.07685

  41. [41]

    Maximum likelihood estimation of observer error-rates using the em algorithm,

    A. P. Dawid and A. M. Skene, “Maximum likelihood estimation of observer error-rates using the em algorithm,”Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 28, no. 1, pp. 20– 28, 1979

  42. [42]

    timsainb/noisereduce: v1.0,

    T. Sainburg, “timsainb/noisereduce: v1.0,” Jun. 2019. [Online]. Available: https://doi.org/10.5281/zenodo.3243139

  43. [43]

    Opensmile: the munich versatile and fast open-source audio feature extractor,

    F. Eyben, M. W ¨ollmer, and B. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” inProceedings of the 18th ACM International Conference on Multimedia, ser. MM ’10. NY , USA: Association for Computing Machinery, 2010, p. 1459–1462

  44. [44]

    The influence of dataset partitioning on dysfluency detection systems,

    S. P. Bayerl, D. Wagner, E. N ¨oth, T. Bocklet, and K. Riedhammer, “The influence of dataset partitioning on dysfluency detection systems,” inText, Speech, and Dialogue, P. Sojka, A. Hor ´ak, I. Kope ˇcek, and K. Pala, Eds. Springer International Publishing, 2022, pp. 423–436

  45. [45]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”

  46. [46]

    Decoupled Weight Decay Regularization

    [Online]. Available: https://arxiv.org/abs/1711.05101

  47. [47]

    The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,

    S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,”PLOS ONE, vol. 13, no. 5, 2018

  48. [48]

    Surrey Audio-Visual Expressed Emotion (SAVEE) Database,

    P. Jackson and S. Haq, “Surrey Audio-Visual Expressed Emotion (SAVEE) Database,” http://kahlan.eps.surrey.ac.uk/savee/Database.html

  49. [49]

    Toronto emotional speech set (TESS),

    M. K. Pichora-Fuller and K. Dupuis, “Toronto emotional speech set (TESS),” 2020

  50. [50]

    Real-time stress detection model and voice analysis: An integrated vr-based game for training public speaking skills,

    Arushi, R. Dillon, and A. N. Teoh, “Real-time stress detection model and voice analysis: An integrated vr-based game for training public speaking skills,” in2021 IEEE Conference on Games (CoG), 2021, pp. 1–4

  51. [51]

    Performance evaluation of different speech-based emotional stress level detection approaches,

    J. Sta ˇs, S. Ond ´aˇs, and J. Juh ´ar, “Performance evaluation of different speech-based emotional stress level detection approaches,” IEEE Access, vol. 13, p. 112880–112904, 2025. [Online]. Available: http://dx.doi.org/10.1109/ACCESS.2025.3584534

  52. [52]

    On calibration of modern neural networks,

    C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” 2017

  53. [53]

    Adam: A method for stochastic optimization,

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”

  54. [54]

    Adam: A Method for Stochastic Optimization

    [Online]. Available: https://arxiv.org/abs/1412.6980

  55. [55]

    A unified approach to interpreting model predictions,

    S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” inAdvances in Neural Information Processing Systems 30, I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 4765–4774

  56. [56]

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

    W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021

  57. [57]

    Wavlm: Large-scale self-supervised pre- training for full stack speech processing,

    S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “Wavlm: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, p. 1505–1518, Oct. 2022

  58. [58]

    The look of (un)confidence: Visual markers for inferring speaker confidence in speech,

    Y . Mori and M. D. Pell, “The look of (un)confidence: Visual markers for inferring speaker confidence in speech,”Frontiers in Communication, vol. 4, Nov. 2019. VII. BIOGRAPHYSECTION Adam Wynnis a PhD student in the Department of Computer Science at Durham University. He is interested in AI in education, automatic feedback, adaptive learning and educationa...