Recognition: 2 theorem links
· Lean TheoremToo Good to Be True: A Study on Modern Automatic Speech Recognition for the Evaluation of Speech Enhancement
Pith reviewed 2026-05-13 03:37 UTC · model grok-4.3
The pith
Modern ASR models trained on noisy data match human word error rates on enhanced speech more closely than older systems, yet their noise tolerance makes them less useful for judging acoustic enhancement quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A listening experiment establishes that modern automatic speech recognition systems trained on large-scale noisy data and equipped with language models achieve higher correlation with human word error rates on enhanced speech than simpler ASR setups, with a transducer model providing the most reliable transcriptions, while their robustness to noise and use of linguistic context render them uninformative for an acoustics-focused evaluation of enhancement performance.
What carries the argument
The listening experiment that gathers human transcriptions of enhanced speech signals and directly compares the resulting word error rates against those produced by multiple ASR architectures that differ in training data scale and internal language modeling.
If this is right
- Evaluations of speech enhancement can adopt modern ASR systems as a more human-aligned automatic metric than older recognizers.
- Transducer architectures become a practical choice for generating transcriptions in such evaluations.
- The choice of ASR system can change which enhancement algorithm appears superior, so results must be interpreted with the model's noise robustness in mind.
- Acoustic-specific instrumental metrics remain necessary when the goal is to isolate signal-level improvements rather than overall intelligibility.
Where Pith is reading between the lines
- Evaluation protocols could combine a modern ASR with an auxiliary acoustic probe that forces sensitivity to distortions the language model normally ignores.
- The same tension between robustness and task-specific sensitivity may appear when AI-based listeners are used to assess dereverberation or source separation.
- For pure acoustic ranking, one might deliberately limit context use in ASR or train variants focused only on signal fidelity rather than full intelligibility.
Load-bearing premise
The listening experiment with its chosen listeners, speech material, and enhancement conditions supplies a ground-truth measure of human recognition performance that holds beyond the tested conditions.
What would settle it
A new listening test using different speakers, noise types, or listener pool that shows modern ASR models no longer produce word error rates closer to humans than simpler models, or that shows the transducer model is no longer the most reliable.
Figures
read the original abstract
Speech enhancement (SE) systems are typically evaluated using a variety of instrumental metrics. The use of automatic speech recognition (ASR) systems to evaluate SE performance is common in literature, usually in terms of word error rate (WER). However, WER scores depend heavily on the choice of ASR system and text normalization pipeline. In this paper, we investigate how modern ASR models correlate with human recognition of enhanced speech. A listening experiment reveals that modern ASR models with large-scale noisy training and embedded language models correlate more with human WER than simpler ones, with a transducer model providing the most reliable transcriptions. Nevertheless, we also show that these models' robustness to noise and use of context can be uninformative to an acoustics-focused evaluation of enhancement performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that modern ASR models trained on large-scale noisy data and incorporating language models correlate more strongly with human word error rates on enhanced speech than simpler ASR systems, with a transducer model providing the most reliable transcriptions. It further argues that the noise robustness and contextual capabilities of these advanced models can render them uninformative for purely acoustics-focused evaluation of speech enhancement performance.
Significance. If the correlation findings hold under scrutiny, the work offers a useful cautionary note for the speech enhancement community on the risks of using overly capable ASR systems as proxies for human perception, potentially leading to more careful selection of evaluation tools that better isolate acoustic improvements.
major comments (2)
- [Listening experiment and correlation results] The description of the listening experiment (referenced in the abstract and results) provides no information on the number of listeners, number of utterances, inter-listener agreement metrics, statistical tests for the reported correlations, error bars, or controls for speaker/condition variability. These details are load-bearing for the central claim that modern ASR models correlate better with human WER.
- [Results and discussion sections] The claim that the transducer model is 'the most reliable' and that robustness/context use can be 'uninformative' for acoustics-focused SE evaluation rests on the human WER ground truth; without reported sample sizes, significance testing, or analysis of generalization beyond the tested speaker pool and conditions, it is unclear whether the ranking and the follow-on warning would survive modest changes to the experimental setup.
minor comments (1)
- [Abstract] The abstract states the main findings but does not quantify the correlations or list the specific ASR models compared; adding one or two key numbers or model names would improve clarity without lengthening the text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the experimental reporting and the robustness of our claims. We address each major comment below and have updated the manuscript to improve clarity and completeness.
read point-by-point responses
-
Referee: [Listening experiment and correlation results] The description of the listening experiment (referenced in the abstract and results) provides no information on the number of listeners, number of utterances, inter-listener agreement metrics, statistical tests for the reported correlations, error bars, or controls for speaker/condition variability. These details are load-bearing for the central claim that modern ASR models correlate better with human WER.
Authors: We agree that these methodological details are essential and should have been included. The original submission omitted a full description of the listening test protocol. In the revised manuscript we have added a new subsection under 'Experimental Setup' that reports the number of listeners, total utterances evaluated, inter-listener agreement (Fleiss' kappa), the statistical tests performed on the correlations (including p-values), error bars on all relevant figures, and the balancing procedures used to control for speaker and condition variability. These additions directly support the reported correlations without altering the underlying data or conclusions. revision: yes
-
Referee: [Results and discussion sections] The claim that the transducer model is 'the most reliable' and that robustness/context use can be 'uninformative' for acoustics-focused SE evaluation rests on the human WER ground truth; without reported sample sizes, significance testing, or analysis of generalization beyond the tested speaker pool and conditions, it is unclear whether the ranking and the follow-on warning would survive modest changes to the experimental setup.
Authors: We acknowledge the need for explicit sample-size reporting and significance testing in the results. The revised version now states the exact number of utterances and listeners underlying each correlation, includes p-values for all reported correlations, and adds a limitations paragraph in the discussion that notes the scope of the speaker pool and conditions tested. We maintain that the transducer model's superior alignment with human WER is supported by the data we collected; however, we have softened the language around generalizability to reflect that broader validation would be valuable in future work. The core cautionary message about overly robust ASR systems remains unchanged because it follows directly from the observed behavior on the evaluated data. revision: partial
Circularity Check
No circularity: purely empirical study with no derivations or self-referential reductions
full rationale
The paper is an empirical investigation that collects human listening data on enhanced speech, computes WER for various ASR models, and reports correlations between them. No equations, parameter fittings, uniqueness theorems, or ansatzes are invoked; the central claims rest on direct experimental measurements rather than any derivation chain that could reduce to its own inputs by construction. Self-citations, if present, are not load-bearing for the reported rankings or conclusions, and the work does not rename known results or smuggle assumptions via prior author work. The study is therefore self-contained against external benchmarks with no detectable circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human listeners provide an unbiased and generalizable measure of speech intelligibility under the test conditions used.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearA listening experiment reveals that modern ASR models with large-scale noisy training and embedded language models correlate more with human WER than simpler ones, with a transducer model providing the most reliable transcriptions.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearWER = S+D+I / S+D+C ... We employ several other SE metrics for comparison: POLQA, SCOREQ, ESTOI, LPS
Reference graph
Works this paper leans on
-
[1]
Introduction The task of speech enhancement (SE) generally aims at improv- ing the quality and intelligibility of degraded speech recordings. There exists a multitude of instrumental metrics to evaluate SE performance without having to resort to human-generated scores, which are costly to collect, time- and effort-wise. As a way of avoiding biases towards...
work page 2022
-
[2]
Method 2.1. Word error rate WER is the most common metric to evaluate the performance of ASR systems. It is derived from the Levenshtein (edit) distance, which measures the difference between sequences [19]. More precisely, it measures the minimum number of edits required to change one sequence into another. arXiv:2605.12107v1 [eess.AS] 12 May 2026 WER is...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
Results In order to facilitate the interpretation of correlations with other standard SE metrics where higher values are better, we compute the W Accinstead of the WER. Moreover, to mitigate the effect of catastrophic failures of Whisper while sticking to the typi- cal aggregation procedure of averaging WER / W Accscores, we clip W Accby computing W Acc =...
work page 2061
-
[4]
Conclusion ASR-based evaluation metrics are often used to evaluate SE sys- tems. This paper analyzes the behavior of different ASR models regarding their transcription accuracy scores and subsequent ranking of SE models. Models trained with large-scale, noisy data are the ones that best match the trends in human transcrip- tion capabilities, although in o...
-
[5]
Acknowledgments We would like to thank all the test subjects who agreed to take part in the listening experiment
-
[6]
Generative AI Use Disclosure No AI writing tools were used for the writing of this paper
-
[7]
The PESQetarian: On the Relevance of Goodhart’s Law for Speech Enhancement,
D. de Oliveira, S. Welker, J. Richter, and T. Gerkmann, “The PESQetarian: On the Relevance of Goodhart’s Law for Speech Enhancement,” inISCA Interspeech, 2024, pp. 3854–3858
work page 2024
-
[8]
ICASSP 2022 deep noise suppression challenge,
H. Dubeyet al., “ICASSP 2022 deep noise suppression challenge,” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2022, pp. 9271–9275
work page 2022
-
[9]
A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evalu- ation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2001
work page 2001
-
[10]
SDR - Half-baked or well done?
J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR - Half-baked or well done?” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2019, pp. 626–630
work page 2019
-
[11]
G. Mittag, B. Naderi, A. Chehadi, and S. M¨oller, “NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” inISCA Interspeech, 2021, pp. 2127–2131
work page 2021
-
[12]
C. K. A. Reddy, V . Gopal, and R. Cutler, “DNSMOS P.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2022, pp. 886–890
work page 2022
-
[13]
UTMOS: Utokyo-sarulab system for V oiceMOS challenge 2022,
T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: Utokyo-sarulab system for V oiceMOS challenge 2022,” inISCA Interspeech, 2022, pp. 4521–4525
work page 2022
-
[14]
An algorithm for intelligibility prediction of time–frequency weighted noisy speech,
C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,”IEEE/ACM Trans. Audio, Speech, Language Proc., vol. 19, no. 7, pp. 2125–2136, 2011
work page 2011
-
[15]
An algorithm for predicting the intelligi- bility of speech masked by modulated noise maskers,
J. Jensen and C. H. Taal, “An algorithm for predicting the intelligi- bility of speech masked by modulated noise maskers,”IEEE/ACM Trans. Audio, Speech, Language Proc., vol. 24, no. 11, pp. 2009– 2022, 2016
work page 2009
-
[16]
ASR-based speech intelligibility prediction: A review,
M. Karbasi and D. Kolossa, “ASR-based speech intelligibility prediction: A review,”Hearing Research, vol. 426, p. 108606, 2022
work page 2022
-
[17]
Matrix sentence intelligibility prediction using an automatic speech recognition system,
M. R. Sch ¨adler, A. Warzybok, S. Hochmuth, and B. Kollmeier, “Matrix sentence intelligibility prediction using an automatic speech recognition system,”Int. Journal of Audiology, vol. 54, no. sup2, pp. 100–107, 2015
work page 2015
-
[18]
M. Karbasi, S. Bleeck, and D. Kolossa, “Non-intrusive speech in- telligibility prediction using automatic speech recognition derived measures,”arXiv preprint arXiv:2010.08574, 2020
-
[19]
Predicting intelligibility of enhanced speech using poste- riors derived from DNN-based ASR system,
K. Arai, S. Araki, A. Ogawa, K. Kinoshita, T. Nakatani, and T. Irino, “Predicting intelligibility of enhanced speech using poste- riors derived from DNN-based ASR system,” inISCA Interspeech, 2020, pp. 1156–1160
work page 2020
-
[20]
Using deep speech recognition to evaluate speech enhancement methods,
S. Siddiqui, G. Rasool, R. P. Ramachandran, and N. C. Bouaynaya, “Using deep speech recognition to evaluate speech enhancement methods,” inInt. Joint Conf. on Neural Networks (IJCNN), 2020, pp. 1–7
work page 2020
-
[21]
Are these even words? Quantifying the gibberishness of generative speech models,
D. de Oliveira, T. Peer, J. Rochdi, and T. Gerkmann, “Are these even words? Quantifying the gibberishness of generative speech models,” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2026, pp. 22 472–22 476
work page 2026
-
[22]
How bad are artifacts?: Analyzing the impact of speech enhancement errors on ASR,
K. Iwamoto, T. Ochiai, M. Delcroix, R. Ikeshita, H. Sato, S. Araki, and S. Katagiri, “How bad are artifacts?: Analyzing the impact of speech enhancement errors on ASR,” inISCA Interspeech, 2022, pp. 5418–5422
work page 2022
-
[23]
S. Araki, A. Yamamoto, T. Ochiai, K. Arai, A. Ogawa, T. Nakatani, and T. Irino, “Impact of residual noise and artifacts in speech enhancement errors on intelligibility of human and machine,” in ISCA Interspeech, 2023, pp. 2503–2507
work page 2023
-
[24]
How does end-to-end speech recognition training impact speech enhancement artifacts?
K. Iwamoto, T. Ochiai, M. Delcroix, R. Ikeshita, H. Sato, S. Araki, and S. Katagiri, “How does end-to-end speech recognition training impact speech enhancement artifacts?” inIEEE Int. Conf. Acous- tics, Speech, Signal Proc. (ICASSP), 2024, pp. 11 031–11 035
work page 2024
-
[25]
Binary codes capable of correcting dele- tions, insertions, and reversals,
V . I. Levenshteinet al., “Binary codes capable of correcting dele- tions, insertions, and reversals,” inSoviet physics doklady, vol. 10, no. 8, 1966, pp. 707–710
work page 1966
-
[26]
A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Con- nectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” inInt. Conf. Machine Learn- ing (ICML), 2006, p. 369–376
work page 2006
-
[27]
Sequence transduction with recurrent neural networks,
A. Graves, “Sequence transduction with recurrent neural networks,” Int. Conf. Machine Learning (ICML) Workshop on Representation Learning, 2012
work page 2012
-
[28]
Attention-based models for speech recognition,
J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Bengio, “Attention-based models for speech recognition,” inAdv. in Neural Information Proc. Sys., 2015, p. 577–585
work page 2015
-
[29]
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,
W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2016, pp. 4960–4964
work page 2016
-
[30]
End-to-end automatic speech translation of audiobooks,
A. B ´erard, L. Besacier, A. C. Kocabiyikoglu, and O. Pietquin, “End-to-end automatic speech translation of audiobooks,” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2018, pp. 6224–6228
work page 2018
-
[31]
Librispeech: An ASR corpus based on public domain audio books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2015, pp. 5206–5210
work page 2015
-
[32]
Deep speech 2 : End-to-end speech recognition in english and mandarin,
D. Amodeiet al., “Deep speech 2 : End-to-end speech recognition in english and mandarin,” inInt. Conf. Machine Learning (ICML), vol. 48, 2016, pp. 173–182
work page 2016
-
[33]
wav2vec 2.0: A framework for self-supervised learning of speech representations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” inAdv. in Neural Information Proc. Sys., H. Larochelle, M. Ran- zato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33, 2020, pp. 12 449–12 460
work page 2020
-
[34]
Robust speech recognition via large-scale weak su- pervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inInt. Conf. Machine Learning (ICML), vol. 202, 2023, pp. 28 492–28 518
work page 2023
-
[35]
QuartzNet: Deep automatic speech recognition with 1D time-channel separable con- volutions,
S. Kriman, S. Beliaev, B. Ginsburg, J. Huang, O. Kuchaiev, V . Lavrukhin, R. Leary, J. Li, and Y . Zhang, “QuartzNet: Deep automatic speech recognition with 1D time-channel separable con- volutions,” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2020, pp. 6124–6128
work page 2020
-
[36]
Fast conformer with linearly scalable attention for efficient speech recognition,
D. Rekeshet al., “Fast conformer with linearly scalable attention for efficient speech recognition,” inIEEE Workshop Autom. Speech Recog. and Underst. (ASRU), 2023, pp. 1–8
work page 2023
-
[37]
Efficient sequence transduction by jointly predicting tokens and durations,
H. Xu, F. Jia, S. Majumdar, H. Huang, S. Watanabe, and B. Gins- burg, “Efficient sequence transduction by jointly predicting tokens and durations,” inInt. Conf. Machine Learning (ICML), 2023
work page 2023
-
[38]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdv. in Neural Information Proc. Sys., vol. 30, 2017
work page 2017
-
[39]
Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling,
S. Gandhi, P. V on Platen, and A. M. Rush, “Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling,”arXiv preprint arXiv:2311.00430, 2023
-
[40]
Hallucinations in neural automatic speech recognition: Identifying errors and hallucinatory models,
R. Frieske and B. E. Shi, “Hallucinations in neural automatic speech recognition: Identifying errors and hallucinatory models,” arXiv preprint arXiv:2401.01572, 2024
-
[41]
Careless whisper: Speech-to-text hallucination harms,
A. Koenecke, A. S. G. Choi, K. X. Mei, H. Schellmann, and M. Sloane, “Careless whisper: Speech-to-text hallucination harms,” inProc. ACM Conf. on Fairness, Accountability, and Transparency, 2024, p. 1672–1681
work page 2024
-
[42]
Investigation of Whisper ASR hallucinations induced by non-speech audio,
M. Bara ´nski, J. Jasi ´nski, J. Bartolewska, S. Kacprzak, M. Witkowski, and K. Kowalczyk, “Investigation of Whisper ASR hallucinations induced by non-speech audio,” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2025, pp. 1–5
work page 2025
-
[43]
Investigating RNN-based speech enhancement methods for noise- robust text-to-speech,
C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investigating RNN-based speech enhancement methods for noise- robust text-to-speech,” in9th Speech Synthesis Workshop (SSW), 2016
work page 2016
-
[44]
D. de Oliveira, J. Richter, J.-M. Lemercier, T. Peer, and T. Gerk- mann, “On the behavior of intrusive and non-intrusive speech enhancement metrics in predictive and generative settings,” in ITG-Fachtagung Sprachkommunikation, 2023, pp. 260–264
work page 2023
-
[45]
Evaluation metrics for generative speech en- hancement methods: Issues and perspectives,
J. Pirklbauer, M. Sach, K. Fluyt, W. Tirry, W. Wardah, S. Moeller, and T. Fingscheidt, “Evaluation metrics for generative speech en- hancement methods: Issues and perspectives,” inITG-Fachtagung Sprachkommunikation, 2023, pp. 265–269
work page 2023
-
[46]
Speech enhancement and dereverberation with diffusion-based generative models,
J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech enhancement and dereverberation with diffusion-based generative models,”IEEE/ACM Trans. Audio, Speech, Language Proc., vol. 31, pp. 2351–2364, 2023
work page 2023
-
[47]
Investigating training objectives for generative speech enhancement,
J. Richter, D. de Oliveira, and T. Gerkmann, “Investigating training objectives for generative speech enhancement,” inIEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), 2025, pp. 1–5
work page 2025
-
[48]
J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “Analysing diffusion-based generative approaches versus discrimi- native approaches for speech restoration,” inIEEE Int. Conf. Acous- tics, Speech, Signal Proc. (ICASSP), 2023, pp. 1–5
work page 2023
-
[49]
StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,
——, “StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,”IEEE/ACM Trans. Audio, Speech, Language Proc., vol. 31, pp. 2724–2737, 2023
work page 2023
-
[50]
An investigation of incorporating Mamba for speech enhancement,
R. Chao, W.-H. Cheng, M. L. Quatra, S. M. Siniscalchi, C.-H. H. Yang, S.-W. Fu, and Y . Tsao, “An investigation of incorporating Mamba for speech enhancement,” inIEEE Spoken Language Tech- nology Workshop (SLT), 2024, pp. 302–308
work page 2024
-
[51]
Mamba: Linear-time sequence modeling with selective state spaces,
A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” inFirst Conf. on Language Modeling, 2024
work page 2024
-
[52]
MP-SENet: A speech enhance- ment model with parallel denoising of magnitude and phase spec- tra,
Y .-X. Lu, Y . Ai, and Z.-H. Ling, “MP-SENet: A speech enhance- ment model with parallel denoising of magnitude and phase spec- tra,” inISCA Interspeech, 2023, pp. 3834–3838
work page 2023
-
[53]
J. G. Beerends, C. Schmidmer, J. Berger, M. Obermann, R. Ull- mann, J. Pomy, and M. Keyhl, “Perceptual objective listening quality assessment (POLQA), the third generation ITU-T standard for end-to-end speech quality measurement part I - temporal align- ment,”J. Audio Engineering Soc., vol. 61, no. 6, pp. 366–384, 2013
work page 2013
-
[54]
Perceptual objective listening quality prediction,
“Perceptual objective listening quality prediction,” Rec. ITU-T P.863, International Telecommunications Union, Recommendation, 2018
work page 2018
-
[55]
SCOREQ: Speech quality assessment with contrastive regression,
A. Ragano, J. Skoglund, and A. Hines, “SCOREQ: Speech quality assessment with contrastive regression,” inAdv. in Neural Informa- tion Proc. Sys., vol. 37, 2024, pp. 105 702–105 729
work page 2024
-
[56]
Deep noise suppression maximizing non-differentiable PESQ mediated by a non-intrusive PESQNet,
Z. Xu, M. Strake, and T. Fingscheidt, “Deep noise suppression maximizing non-differentiable PESQ mediated by a non-intrusive PESQNet,”IEEE/ACM Trans. Audio, Speech, Language Proc., vol. 30, pp. 1572–1585, 2022
work page 2022
-
[57]
EARS: An anechoic fullband speech dataset benchmarked for speech enhancement and derever- beration,
J. Richter, Y .-C. Wu, S. Krenn, S. Welker, B. Lay, S. Watanabe, A. Richard, and T. Gerkmann, “EARS: An anechoic fullband speech dataset benchmarked for speech enhancement and derever- beration,” inISCA Interspeech, 2024, pp. 4873–4877
work page 2024
-
[58]
From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition,
A. C. Morris, V . Maier, and P. Green, “From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition,” inISCA Interspeech, 2004, pp. 2765–2768
work page 2004
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.