pith. machine review for the scientific record. sign in

arxiv: 2605.12107 · v1 · submitted 2026-05-12 · 📡 eess.AS

Recognition: 2 theorem links

· Lean Theorem

Too Good to Be True: A Study on Modern Automatic Speech Recognition for the Evaluation of Speech Enhancement

Danilo de Oliveira, Tal Peer, Timo Gerkmann

Pith reviewed 2026-05-13 03:37 UTC · model grok-4.3

classification 📡 eess.AS
keywords speech enhancementautomatic speech recognitionword error ratelistening experimenttransducer modelhuman correlationevaluation metricsnoise robustness
0
0 comments X

The pith

Modern ASR models trained on noisy data match human word error rates on enhanced speech more closely than older systems, yet their noise tolerance makes them less useful for judging acoustic enhancement quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether automatic speech recognition can serve as a reliable stand-in for human listeners when measuring how well speech enhancement improves intelligibility. A listening experiment directly compares human transcriptions of enhanced audio against outputs from several ASR systems. Modern models that saw large amounts of noisy training data and include language models produce error rates that track human performance better than simpler recognizers, and a transducer model yields the most consistent results. The same noise robustness and contextual processing that drive this improved match, however, also make these models overlook the specific acoustic changes that enhancement algorithms are designed to deliver.

Core claim

A listening experiment establishes that modern automatic speech recognition systems trained on large-scale noisy data and equipped with language models achieve higher correlation with human word error rates on enhanced speech than simpler ASR setups, with a transducer model providing the most reliable transcriptions, while their robustness to noise and use of linguistic context render them uninformative for an acoustics-focused evaluation of enhancement performance.

What carries the argument

The listening experiment that gathers human transcriptions of enhanced speech signals and directly compares the resulting word error rates against those produced by multiple ASR architectures that differ in training data scale and internal language modeling.

If this is right

  • Evaluations of speech enhancement can adopt modern ASR systems as a more human-aligned automatic metric than older recognizers.
  • Transducer architectures become a practical choice for generating transcriptions in such evaluations.
  • The choice of ASR system can change which enhancement algorithm appears superior, so results must be interpreted with the model's noise robustness in mind.
  • Acoustic-specific instrumental metrics remain necessary when the goal is to isolate signal-level improvements rather than overall intelligibility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Evaluation protocols could combine a modern ASR with an auxiliary acoustic probe that forces sensitivity to distortions the language model normally ignores.
  • The same tension between robustness and task-specific sensitivity may appear when AI-based listeners are used to assess dereverberation or source separation.
  • For pure acoustic ranking, one might deliberately limit context use in ASR or train variants focused only on signal fidelity rather than full intelligibility.

Load-bearing premise

The listening experiment with its chosen listeners, speech material, and enhancement conditions supplies a ground-truth measure of human recognition performance that holds beyond the tested conditions.

What would settle it

A new listening test using different speakers, noise types, or listener pool that shows modern ASR models no longer produce word error rates closer to humans than simpler models, or that shows the transducer model is no longer the most reliable.

Figures

Figures reproduced from arXiv: 2605.12107 by Danilo de Oliveira, Tal Peer, Timo Gerkmann.

Figure 1
Figure 1. Figure 1: Decomposition of error sources for each ASR model, across all enhanced audio systems. Substitution, deletion and insertion rates, respectively, according to Equation 3. These errors are visualized across input SNR, grouped in bins at 2.5 dB intervals. variation in the range of scores. For Parakeet and Whisper, as well as humans, the enhancement of speech is counterproductive to speech recognition, as the W… view at source ↗
read the original abstract

Speech enhancement (SE) systems are typically evaluated using a variety of instrumental metrics. The use of automatic speech recognition (ASR) systems to evaluate SE performance is common in literature, usually in terms of word error rate (WER). However, WER scores depend heavily on the choice of ASR system and text normalization pipeline. In this paper, we investigate how modern ASR models correlate with human recognition of enhanced speech. A listening experiment reveals that modern ASR models with large-scale noisy training and embedded language models correlate more with human WER than simpler ones, with a transducer model providing the most reliable transcriptions. Nevertheless, we also show that these models' robustness to noise and use of context can be uninformative to an acoustics-focused evaluation of enhancement performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that modern ASR models trained on large-scale noisy data and incorporating language models correlate more strongly with human word error rates on enhanced speech than simpler ASR systems, with a transducer model providing the most reliable transcriptions. It further argues that the noise robustness and contextual capabilities of these advanced models can render them uninformative for purely acoustics-focused evaluation of speech enhancement performance.

Significance. If the correlation findings hold under scrutiny, the work offers a useful cautionary note for the speech enhancement community on the risks of using overly capable ASR systems as proxies for human perception, potentially leading to more careful selection of evaluation tools that better isolate acoustic improvements.

major comments (2)
  1. [Listening experiment and correlation results] The description of the listening experiment (referenced in the abstract and results) provides no information on the number of listeners, number of utterances, inter-listener agreement metrics, statistical tests for the reported correlations, error bars, or controls for speaker/condition variability. These details are load-bearing for the central claim that modern ASR models correlate better with human WER.
  2. [Results and discussion sections] The claim that the transducer model is 'the most reliable' and that robustness/context use can be 'uninformative' for acoustics-focused SE evaluation rests on the human WER ground truth; without reported sample sizes, significance testing, or analysis of generalization beyond the tested speaker pool and conditions, it is unclear whether the ranking and the follow-on warning would survive modest changes to the experimental setup.
minor comments (1)
  1. [Abstract] The abstract states the main findings but does not quantify the correlations or list the specific ASR models compared; adding one or two key numbers or model names would improve clarity without lengthening the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the experimental reporting and the robustness of our claims. We address each major comment below and have updated the manuscript to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Listening experiment and correlation results] The description of the listening experiment (referenced in the abstract and results) provides no information on the number of listeners, number of utterances, inter-listener agreement metrics, statistical tests for the reported correlations, error bars, or controls for speaker/condition variability. These details are load-bearing for the central claim that modern ASR models correlate better with human WER.

    Authors: We agree that these methodological details are essential and should have been included. The original submission omitted a full description of the listening test protocol. In the revised manuscript we have added a new subsection under 'Experimental Setup' that reports the number of listeners, total utterances evaluated, inter-listener agreement (Fleiss' kappa), the statistical tests performed on the correlations (including p-values), error bars on all relevant figures, and the balancing procedures used to control for speaker and condition variability. These additions directly support the reported correlations without altering the underlying data or conclusions. revision: yes

  2. Referee: [Results and discussion sections] The claim that the transducer model is 'the most reliable' and that robustness/context use can be 'uninformative' for acoustics-focused SE evaluation rests on the human WER ground truth; without reported sample sizes, significance testing, or analysis of generalization beyond the tested speaker pool and conditions, it is unclear whether the ranking and the follow-on warning would survive modest changes to the experimental setup.

    Authors: We acknowledge the need for explicit sample-size reporting and significance testing in the results. The revised version now states the exact number of utterances and listeners underlying each correlation, includes p-values for all reported correlations, and adds a limitations paragraph in the discussion that notes the scope of the speaker pool and conditions tested. We maintain that the transducer model's superior alignment with human WER is supported by the data we collected; however, we have softened the language around generalizability to reflect that broader validation would be valuable in future work. The core cautionary message about overly robust ASR systems remains unchanged because it follows directly from the observed behavior on the evaluated data. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical study with no derivations or self-referential reductions

full rationale

The paper is an empirical investigation that collects human listening data on enhanced speech, computes WER for various ASR models, and reports correlations between them. No equations, parameter fittings, uniqueness theorems, or ansatzes are invoked; the central claims rest on direct experimental measurements rather than any derivation chain that could reduce to its own inputs by construction. Self-citations, if present, are not load-bearing for the reported rankings or conclusions, and the work does not rename known results or smuggle assumptions via prior author work. The study is therefore self-contained against external benchmarks with no detectable circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that human word error rate from a controlled listening test is the appropriate ground truth for evaluating ASR as a proxy for speech enhancement quality. No free parameters or invented entities are introduced; the work is purely comparative.

axioms (1)
  • domain assumption Human listeners provide an unbiased and generalizable measure of speech intelligibility under the test conditions used.
    Invoked implicitly when using human WER as the reference for ASR correlation.

pith-pipeline@v0.9.0 · 5429 in / 1250 out tokens · 70959 ms · 2026-05-13T03:37:47.003128+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 1 internal anchor

  1. [1]

    tradi- tional

    Introduction The task of speech enhancement (SE) generally aims at improv- ing the quality and intelligibility of degraded speech recordings. There exists a multitude of instrumental metrics to evaluate SE performance without having to resort to human-generated scores, which are costly to collect, time- and effort-wise. As a way of avoiding biases towards...

  2. [2]

    Too Good to Be True: A Study on Modern Automatic Speech Recognition for the Evaluation of Speech Enhancement

    Method 2.1. Word error rate WER is the most common metric to evaluate the performance of ASR systems. It is derived from the Levenshtein (edit) distance, which measures the difference between sequences [19]. More precisely, it measures the minimum number of edits required to change one sequence into another. arXiv:2605.12107v1 [eess.AS] 12 May 2026 WER is...

  3. [3]

    Results In order to facilitate the interpretation of correlations with other standard SE metrics where higher values are better, we compute the W Accinstead of the WER. Moreover, to mitigate the effect of catastrophic failures of Whisper while sticking to the typi- cal aggregation procedure of averaging WER / W Accscores, we clip W Accby computing W Acc =...

  4. [4]

    This paper analyzes the behavior of different ASR models regarding their transcription accuracy scores and subsequent ranking of SE models

    Conclusion ASR-based evaluation metrics are often used to evaluate SE sys- tems. This paper analyzes the behavior of different ASR models regarding their transcription accuracy scores and subsequent ranking of SE models. Models trained with large-scale, noisy data are the ones that best match the trends in human transcrip- tion capabilities, although in o...

  5. [5]

    Acknowledgments We would like to thank all the test subjects who agreed to take part in the listening experiment

  6. [6]

    Generative AI Use Disclosure No AI writing tools were used for the writing of this paper

  7. [7]

    The PESQetarian: On the Relevance of Goodhart’s Law for Speech Enhancement,

    D. de Oliveira, S. Welker, J. Richter, and T. Gerkmann, “The PESQetarian: On the Relevance of Goodhart’s Law for Speech Enhancement,” inISCA Interspeech, 2024, pp. 3854–3858

  8. [8]

    ICASSP 2022 deep noise suppression challenge,

    H. Dubeyet al., “ICASSP 2022 deep noise suppression challenge,” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2022, pp. 9271–9275

  9. [9]

    Perceptual evalu- ation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,

    A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evalu- ation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2001

  10. [10]

    SDR - Half-baked or well done?

    J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR - Half-baked or well done?” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2019, pp. 626–630

  11. [11]

    NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,

    G. Mittag, B. Naderi, A. Chehadi, and S. M¨oller, “NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” inISCA Interspeech, 2021, pp. 2127–2131

  12. [12]

    DNSMOS P.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,

    C. K. A. Reddy, V . Gopal, and R. Cutler, “DNSMOS P.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2022, pp. 886–890

  13. [13]

    UTMOS: Utokyo-sarulab system for V oiceMOS challenge 2022,

    T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: Utokyo-sarulab system for V oiceMOS challenge 2022,” inISCA Interspeech, 2022, pp. 4521–4525

  14. [14]

    An algorithm for intelligibility prediction of time–frequency weighted noisy speech,

    C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,”IEEE/ACM Trans. Audio, Speech, Language Proc., vol. 19, no. 7, pp. 2125–2136, 2011

  15. [15]

    An algorithm for predicting the intelligi- bility of speech masked by modulated noise maskers,

    J. Jensen and C. H. Taal, “An algorithm for predicting the intelligi- bility of speech masked by modulated noise maskers,”IEEE/ACM Trans. Audio, Speech, Language Proc., vol. 24, no. 11, pp. 2009– 2022, 2016

  16. [16]

    ASR-based speech intelligibility prediction: A review,

    M. Karbasi and D. Kolossa, “ASR-based speech intelligibility prediction: A review,”Hearing Research, vol. 426, p. 108606, 2022

  17. [17]

    Matrix sentence intelligibility prediction using an automatic speech recognition system,

    M. R. Sch ¨adler, A. Warzybok, S. Hochmuth, and B. Kollmeier, “Matrix sentence intelligibility prediction using an automatic speech recognition system,”Int. Journal of Audiology, vol. 54, no. sup2, pp. 100–107, 2015

  18. [18]

    Non-intrusive speech in- telligibility prediction using automatic speech recognition derived measures,

    M. Karbasi, S. Bleeck, and D. Kolossa, “Non-intrusive speech in- telligibility prediction using automatic speech recognition derived measures,”arXiv preprint arXiv:2010.08574, 2020

  19. [19]

    Predicting intelligibility of enhanced speech using poste- riors derived from DNN-based ASR system,

    K. Arai, S. Araki, A. Ogawa, K. Kinoshita, T. Nakatani, and T. Irino, “Predicting intelligibility of enhanced speech using poste- riors derived from DNN-based ASR system,” inISCA Interspeech, 2020, pp. 1156–1160

  20. [20]

    Using deep speech recognition to evaluate speech enhancement methods,

    S. Siddiqui, G. Rasool, R. P. Ramachandran, and N. C. Bouaynaya, “Using deep speech recognition to evaluate speech enhancement methods,” inInt. Joint Conf. on Neural Networks (IJCNN), 2020, pp. 1–7

  21. [21]

    Are these even words? Quantifying the gibberishness of generative speech models,

    D. de Oliveira, T. Peer, J. Rochdi, and T. Gerkmann, “Are these even words? Quantifying the gibberishness of generative speech models,” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2026, pp. 22 472–22 476

  22. [22]

    How bad are artifacts?: Analyzing the impact of speech enhancement errors on ASR,

    K. Iwamoto, T. Ochiai, M. Delcroix, R. Ikeshita, H. Sato, S. Araki, and S. Katagiri, “How bad are artifacts?: Analyzing the impact of speech enhancement errors on ASR,” inISCA Interspeech, 2022, pp. 5418–5422

  23. [23]

    Impact of residual noise and artifacts in speech enhancement errors on intelligibility of human and machine,

    S. Araki, A. Yamamoto, T. Ochiai, K. Arai, A. Ogawa, T. Nakatani, and T. Irino, “Impact of residual noise and artifacts in speech enhancement errors on intelligibility of human and machine,” in ISCA Interspeech, 2023, pp. 2503–2507

  24. [24]

    How does end-to-end speech recognition training impact speech enhancement artifacts?

    K. Iwamoto, T. Ochiai, M. Delcroix, R. Ikeshita, H. Sato, S. Araki, and S. Katagiri, “How does end-to-end speech recognition training impact speech enhancement artifacts?” inIEEE Int. Conf. Acous- tics, Speech, Signal Proc. (ICASSP), 2024, pp. 11 031–11 035

  25. [25]

    Binary codes capable of correcting dele- tions, insertions, and reversals,

    V . I. Levenshteinet al., “Binary codes capable of correcting dele- tions, insertions, and reversals,” inSoviet physics doklady, vol. 10, no. 8, 1966, pp. 707–710

  26. [26]

    Con- nectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,

    A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Con- nectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” inInt. Conf. Machine Learn- ing (ICML), 2006, p. 369–376

  27. [27]

    Sequence transduction with recurrent neural networks,

    A. Graves, “Sequence transduction with recurrent neural networks,” Int. Conf. Machine Learning (ICML) Workshop on Representation Learning, 2012

  28. [28]

    Attention-based models for speech recognition,

    J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Bengio, “Attention-based models for speech recognition,” inAdv. in Neural Information Proc. Sys., 2015, p. 577–585

  29. [29]

    Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,

    W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2016, pp. 4960–4964

  30. [30]

    End-to-end automatic speech translation of audiobooks,

    A. B ´erard, L. Besacier, A. C. Kocabiyikoglu, and O. Pietquin, “End-to-end automatic speech translation of audiobooks,” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2018, pp. 6224–6228

  31. [31]

    Librispeech: An ASR corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2015, pp. 5206–5210

  32. [32]

    Deep speech 2 : End-to-end speech recognition in english and mandarin,

    D. Amodeiet al., “Deep speech 2 : End-to-end speech recognition in english and mandarin,” inInt. Conf. Machine Learning (ICML), vol. 48, 2016, pp. 173–182

  33. [33]

    wav2vec 2.0: A framework for self-supervised learning of speech representations,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” inAdv. in Neural Information Proc. Sys., H. Larochelle, M. Ran- zato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33, 2020, pp. 12 449–12 460

  34. [34]

    Robust speech recognition via large-scale weak su- pervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inInt. Conf. Machine Learning (ICML), vol. 202, 2023, pp. 28 492–28 518

  35. [35]

    QuartzNet: Deep automatic speech recognition with 1D time-channel separable con- volutions,

    S. Kriman, S. Beliaev, B. Ginsburg, J. Huang, O. Kuchaiev, V . Lavrukhin, R. Leary, J. Li, and Y . Zhang, “QuartzNet: Deep automatic speech recognition with 1D time-channel separable con- volutions,” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2020, pp. 6124–6128

  36. [36]

    Fast conformer with linearly scalable attention for efficient speech recognition,

    D. Rekeshet al., “Fast conformer with linearly scalable attention for efficient speech recognition,” inIEEE Workshop Autom. Speech Recog. and Underst. (ASRU), 2023, pp. 1–8

  37. [37]

    Efficient sequence transduction by jointly predicting tokens and durations,

    H. Xu, F. Jia, S. Majumdar, H. Huang, S. Watanabe, and B. Gins- burg, “Efficient sequence transduction by jointly predicting tokens and durations,” inInt. Conf. Machine Learning (ICML), 2023

  38. [38]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdv. in Neural Information Proc. Sys., vol. 30, 2017

  39. [39]

    Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling,

    S. Gandhi, P. V on Platen, and A. M. Rush, “Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling,”arXiv preprint arXiv:2311.00430, 2023

  40. [40]

    Hallucinations in neural automatic speech recognition: Identifying errors and hallucinatory models,

    R. Frieske and B. E. Shi, “Hallucinations in neural automatic speech recognition: Identifying errors and hallucinatory models,” arXiv preprint arXiv:2401.01572, 2024

  41. [41]

    Careless whisper: Speech-to-text hallucination harms,

    A. Koenecke, A. S. G. Choi, K. X. Mei, H. Schellmann, and M. Sloane, “Careless whisper: Speech-to-text hallucination harms,” inProc. ACM Conf. on Fairness, Accountability, and Transparency, 2024, p. 1672–1681

  42. [42]

    Investigation of Whisper ASR hallucinations induced by non-speech audio,

    M. Bara ´nski, J. Jasi ´nski, J. Bartolewska, S. Kacprzak, M. Witkowski, and K. Kowalczyk, “Investigation of Whisper ASR hallucinations induced by non-speech audio,” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2025, pp. 1–5

  43. [43]

    Investigating RNN-based speech enhancement methods for noise- robust text-to-speech,

    C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investigating RNN-based speech enhancement methods for noise- robust text-to-speech,” in9th Speech Synthesis Workshop (SSW), 2016

  44. [44]

    On the behavior of intrusive and non-intrusive speech enhancement metrics in predictive and generative settings,

    D. de Oliveira, J. Richter, J.-M. Lemercier, T. Peer, and T. Gerk- mann, “On the behavior of intrusive and non-intrusive speech enhancement metrics in predictive and generative settings,” in ITG-Fachtagung Sprachkommunikation, 2023, pp. 260–264

  45. [45]

    Evaluation metrics for generative speech en- hancement methods: Issues and perspectives,

    J. Pirklbauer, M. Sach, K. Fluyt, W. Tirry, W. Wardah, S. Moeller, and T. Fingscheidt, “Evaluation metrics for generative speech en- hancement methods: Issues and perspectives,” inITG-Fachtagung Sprachkommunikation, 2023, pp. 265–269

  46. [46]

    Speech enhancement and dereverberation with diffusion-based generative models,

    J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech enhancement and dereverberation with diffusion-based generative models,”IEEE/ACM Trans. Audio, Speech, Language Proc., vol. 31, pp. 2351–2364, 2023

  47. [47]

    Investigating training objectives for generative speech enhancement,

    J. Richter, D. de Oliveira, and T. Gerkmann, “Investigating training objectives for generative speech enhancement,” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2025, pp. 1–5

  48. [48]

    Analysing diffusion-based generative approaches versus discrimi- native approaches for speech restoration,

    J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “Analysing diffusion-based generative approaches versus discrimi- native approaches for speech restoration,” inIEEE Int. Conf. Acous- tics, Speech, Signal Proc. (ICASSP), 2023, pp. 1–5

  49. [49]

    StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,

    ——, “StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,”IEEE/ACM Trans. Audio, Speech, Language Proc., vol. 31, pp. 2724–2737, 2023

  50. [50]

    An investigation of incorporating Mamba for speech enhancement,

    R. Chao, W.-H. Cheng, M. L. Quatra, S. M. Siniscalchi, C.-H. H. Yang, S.-W. Fu, and Y . Tsao, “An investigation of incorporating Mamba for speech enhancement,” inIEEE Spoken Language Tech- nology Workshop (SLT), 2024, pp. 302–308

  51. [51]

    Mamba: Linear-time sequence modeling with selective state spaces,

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” inFirst Conf. on Language Modeling, 2024

  52. [52]

    MP-SENet: A speech enhance- ment model with parallel denoising of magnitude and phase spec- tra,

    Y .-X. Lu, Y . Ai, and Z.-H. Ling, “MP-SENet: A speech enhance- ment model with parallel denoising of magnitude and phase spec- tra,” inISCA Interspeech, 2023, pp. 3834–3838

  53. [53]

    Perceptual objective listening quality assessment (POLQA), the third generation ITU-T standard for end-to-end speech quality measurement part I - temporal align- ment,

    J. G. Beerends, C. Schmidmer, J. Berger, M. Obermann, R. Ull- mann, J. Pomy, and M. Keyhl, “Perceptual objective listening quality assessment (POLQA), the third generation ITU-T standard for end-to-end speech quality measurement part I - temporal align- ment,”J. Audio Engineering Soc., vol. 61, no. 6, pp. 366–384, 2013

  54. [54]

    Perceptual objective listening quality prediction,

    “Perceptual objective listening quality prediction,” Rec. ITU-T P.863, International Telecommunications Union, Recommendation, 2018

  55. [55]

    SCOREQ: Speech quality assessment with contrastive regression,

    A. Ragano, J. Skoglund, and A. Hines, “SCOREQ: Speech quality assessment with contrastive regression,” inAdv. in Neural Informa- tion Proc. Sys., vol. 37, 2024, pp. 105 702–105 729

  56. [56]

    Deep noise suppression maximizing non-differentiable PESQ mediated by a non-intrusive PESQNet,

    Z. Xu, M. Strake, and T. Fingscheidt, “Deep noise suppression maximizing non-differentiable PESQ mediated by a non-intrusive PESQNet,”IEEE/ACM Trans. Audio, Speech, Language Proc., vol. 30, pp. 1572–1585, 2022

  57. [57]

    EARS: An anechoic fullband speech dataset benchmarked for speech enhancement and derever- beration,

    J. Richter, Y .-C. Wu, S. Krenn, S. Welker, B. Lay, S. Watanabe, A. Richard, and T. Gerkmann, “EARS: An anechoic fullband speech dataset benchmarked for speech enhancement and derever- beration,” inISCA Interspeech, 2024, pp. 4873–4877

  58. [58]

    From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition,

    A. C. Morris, V . Maier, and P. Green, “From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition,” inISCA Interspeech, 2004, pp. 2765–2768