pith. sign in

arxiv: 2406.14294 · v4 · submitted 2024-06-20 · 💻 cs.SD · cs.AI· eess.AS

DASB - Discrete Audio and Speech Benchmark

Pith reviewed 2026-05-24 00:25 UTC · model grok-4.3

classification 💻 cs.SD cs.AIeess.AS
keywords discrete audio tokensspeech benchmarksemantic tokensacoustic tokensmultimodal modelsaudio processinggenerative tasksdiscriminative tasks
0
0 comments X

The pith

Discrete audio tokens are less robust than continuous features and need careful tuning of architecture, data size, learning rate, and capacity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Discrete Audio and Speech Benchmark to create consistent evaluation settings for discrete audio tokens across speech, general audio, and music. It tests both discriminative and generative tasks to compare how well these tokens preserve phonetic content, speaker identity, and other cues. Results indicate that semantic tokens generally perform better than acoustic tokens, yet both types remain less robust than continuous representations. The benchmark highlights that performance depends on factors such as model architecture and training settings, pointing to the need for further development to close the remaining gap.

Core claim

DASB provides a standardized framework for evaluating discrete audio tokens on a range of tasks in speech, audio, and music domains. The evaluation shows discrete representations are less robust than continuous ones and require careful tuning of model architecture, data size, learning rate, and capacity. Semantic tokens outperform acoustic tokens, but a performance gap to continuous features persists, indicating that further research is needed to make discrete tokens reliable for multimodal language models.

What carries the argument

The DASB benchmark framework, which applies consistent tasks and metrics across domains to compare discrete tokens against continuous features.

If this is right

  • Semantic tokens should be preferred over acoustic tokens for most speech and audio tasks when using discrete representations.
  • Model capacity, data volume, and learning rate must be tuned specifically for each discrete tokenizer to achieve reliable results.
  • A performance gap to continuous features will persist until new tokenizer designs or training methods are developed.
  • Inconsistent evaluation settings in prior work likely masked the full extent of robustness issues with discrete tokens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers of multimodal language models may need to combine discrete and continuous representations to reach full performance on audio understanding and generation.
  • The benchmark setup could be extended to additional languages or noisy environments to test whether the observed gaps hold under more varied conditions.
  • Results may encourage work on hybrid tokenizers that aim to retain the efficiency of discrete units while approaching continuous feature robustness.

Load-bearing premise

The chosen tasks, domains, and metrics are representative enough to reveal general limitations of discrete tokens compared to continuous features.

What would settle it

A follow-up experiment on the same models but with a new set of tasks outside the benchmark's covered domains that finds discrete tokens matching or exceeding continuous performance on key metrics.

Figures

Figures reproduced from arXiv: 2406.14294 by Anastasia Kuznetsova, Artem Ploujnikov, Cem Subakan, Darius Petermann, Jarod Duret, Luca Della Libera, Mirco Ravanelli, Pooneh Mousavi.

Figure 1
Figure 1. Figure 1: The workflow of DASB consists of three steps. First, a discrete audio encoder converts the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Time and memory required to process an utterance of 16 seconds for encoders and decoders [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Discrete audio tokens have recently gained considerable attention for their potential to bridge audio and language processing, enabling multimodal language models that can both generate and understand audio. However, preserving key information such as phonetic content, speaker identity, and paralinguistic cues remains a major challenge. Identifying the optimal tokenizer and configuration is further complicated by inconsistent evaluation settings across existing studies. To address this, we introduce the Discrete Audio and Speech Benchmark (DASB), a comprehensive framework for benchmarking discrete audio tokens across speech, general audio, and music domains on a range of discriminative and generative tasks. Our results show that discrete representations are less robust than continuous ones and require careful tuning of factors such as model architecture, data size, learning rate, and capacity. Semantic tokens generally outperform acoustic tokens, but a gap remains between discrete tokens and continuous features, highlighting the need for further research. DASB codes, evaluation setup, and leaderboards are publicly available at https://poonehmousavi.github.io/DASB-website/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces the Discrete Audio and Speech Benchmark (DASB), a standardized framework for evaluating discrete audio tokens on discriminative and generative tasks spanning speech, general audio, and music domains. It reports empirical comparisons showing that discrete representations are less robust than continuous features, require careful tuning of model architecture, data size, learning rate, and capacity, that semantic tokens generally outperform acoustic tokens, and that a performance gap to continuous features persists; the benchmark, code, and leaderboards are released publicly.

Significance. If the controlled comparisons hold, DASB provides a valuable public resource for consistent evaluation of discrete audio tokens, addressing the inconsistency noted in prior work and highlighting practical challenges for multimodal models. The explicit controls for architecture, data size, LR, and capacity, together with public code and leaderboards, are strengths that support reproducibility and community follow-up.

minor comments (2)
  1. The abstract and benchmark description reference comparative findings; the manuscript should include explicit details on data splits, statistical tests, error bars, and hyperparameter search ranges in the experimental sections to allow full verification of the robustness claims.
  2. Task and domain selection (speech, audio, music) should include a short justification subsection explaining why the chosen metrics and tasks are expected to be representative for identifying general limitations of discrete versus continuous representations.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the constructive and positive review, which recognizes the value of DASB as a standardized benchmark and the importance of the controlled comparisons. We are pleased with the recommendation for minor revision and will incorporate any minor suggestions in the revised version.

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or self-referential predictions

full rationale

The paper presents DASB as an empirical evaluation framework for discrete audio tokens across tasks and domains, reporting comparative results on robustness, semantic vs. acoustic tokens, and gaps to continuous features. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear; all claims rest on controlled experiments with public code. The work is therefore self-contained against external benchmarks and contains no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution is the creation of an evaluation framework rather than any new derivation; no free parameters, invented entities, or non-standard axioms are introduced.

axioms (1)
  • domain assumption Existing discriminative and generative tasks in speech, audio, and music are appropriate proxies for measuring preservation of phonetic, speaker, and paralinguistic information.
    The abstract relies on these tasks without additional justification or validation of their representativeness.

pith-pipeline@v0.9.0 · 5730 in / 1219 out tokens · 26373 ms · 2026-05-24T00:25:17.037802+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

    cs.CL 2025-09 unverdicted novelty 6.0

    StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.

  2. On The Landscape of Spoken Language Models: A Comprehensive Survey

    cs.CL 2025-04 unverdicted novelty 3.0

    A literature survey that organizes spoken language models by architecture, training, and evaluation choices and identifies key challenges and future directions.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 2 Pith papers · 7 internal anchors

  1. [1]

    Fundamentals of Speech Recognition

    Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recognition. Prentice-Hall Signal Processing Series, 1993

  2. [2]

    wav2vec 2.0: A framework for self-supervised learning of speech representations

    Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. In International Conference on Neural Information Processing Systems (NeurIPS), volume 33, pages 12449–12460, 2020

  3. [3]

    WavLM: Large-scale self-supervised pre-training for full stack speech processing

    Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

  4. [4]

    HuBERT: Self-supervised speech representation learning by masked prediction of hidden units

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021

  5. [5]

    Speech resynthesis from discrete disentangled self-supervised representations

    Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrah- man Mohamed, and Emmanuel Dupoux. Speech resynthesis from discrete disentangled self-supervised representations. In Interspeech, pages 3615–3619, 2021

  6. [6]

    Phonetic analysis of self-supervised representations of English speech

    Dan Wells, Hao Tang, and Korin Richmond. Phonetic analysis of self-supervised representations of English speech. In Interspeech, pages 3583–3587, 2022

  7. [7]

    w2v-BERT: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

    Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. w2v-BERT: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250, 2021. 9

  8. [8]

    SpeechTokenizer: Unified speech tokenizer for speech large language models

    Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu. SpeechTokenizer: Unified speech tokenizer for speech large language models. In International Conference on Learning Representations (ICLR), 2024

  9. [9]

    FunCodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec

    Zhihao Du, Shiliang Zhang, Kai Hu, and Siqi Zheng. FunCodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec. arXiv preprint arXiv:2309.07405, 2023

  10. [10]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  11. [11]

    PaLM: scaling language modeling with pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: scaling language modeling with pathways. Journal of Machine Learning Research, 24, 2024

  12. [12]

    BERT: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, Volume 1 (Long and ...

  13. [13]

    GPT understands, too

    Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. GPT understands, too. AI Open, 2023

  14. [14]

    Neural discrete representation learning

    Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 2017

  15. [15]

    Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, et al. AudioPaLM: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925, 2023

  16. [16]

    LauraGPT: Listen, attend, understand, and regenerate audio with GPT

    Jiaming Wang, Zhihao Du, Qian Chen, Yunfei Chu, Zhifu Gao, Zerui Li, Kai Hu, Xiaohuan Zhou, Jin Xu, Ziyang Ma, Wen Wang, Siqi Zheng, Chang Zhou, Zhijie Yan, and Shiliang Zhang. LauraGPT: Listen, attend, understand, and regenerate audio with GPT. arXiv preprint arXiv:2310.04673, 2023

  17. [17]

    VioLA: Unified codec language models for speech recognition, synthesis, and translation

    Tianrui Wang, Long Zhou, Ziqiang Zhang, Yu Wu, Shujie Liu, Yashesh Gaur, Zhuo Chen, Jinyu Li, and Furu Wei. VioLA: Unified codec language models for speech recognition, synthesis, and translation. arXiv preprint arXiv:2305.16107, 2023

  18. [19]

    SpeechX: Neural codec language model as a versatile speech transformer

    Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, and Takuya Yoshioka. SpeechX: Neural codec language model as a versatile speech transformer. arXiv preprint arXiv:2308.06873, 2023

  19. [20]

    MusicLM: Generating Music From Text

    Andrea Agostinelli et al. MusicLM: Generating music from text. arXiv preprint arXiv:2301.11325, 2023

  20. [21]

    AudioGen: Textually guided audio generation

    Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. AudioGen: Textually guided audio generation. InInternational Conference on Learning Representations (ICLR), 2023

  21. [22]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Google. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  22. [23]

    Goodfellow, Yoshua Bengio, and Aaron Courville

    Ian J. Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016

  23. [24]

    Codec-SUPERB: An in-depth analysis of sound codec models

    Haibin Wu, Ho-Lam Chung, Yi-Cheng Lin, Yuan-Kuei Wu, Xuanjun Chen, Yu-Chi Pai, Hsiu-Hsuan Wang, Kai-Wei Chang, Alexander H Liu, and Hung-yi Lee. Codec-SUPERB: An in-depth analysis of sound codec models. arXiv preprint arXiv:2402.13071, 2024

  24. [25]

    Puvvada, Nithin Rao Koluguri, Kunal Dhawan, Jagadeesh Balam, and Boris Ginsburg

    Krishna C. Puvvada, Nithin Rao Koluguri, Kunal Dhawan, Jagadeesh Balam, and Boris Ginsburg. Discrete audio representation as an alternative to Mel-spectrograms for speaker and speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 12111–12115, 2024

  25. [26]

    SELM: Speech enhancement using discrete tokens and language models

    Ziqian Wang, Xinfa Zhu, Zihan Zhang, YuanJun Lv, Ning Jiang, Guoqing Zhao, and Lei Xie. SELM: Speech enhancement using discrete tokens and language models. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 11561–11565, 2024

  26. [27]

    DUB: Discrete unit back-translation for speech translation

    Dong Zhang, Rong Ye, Tom Ko, Mingxuan Wang, and Yaqian Zhou. DUB: Discrete unit back-translation for speech translation. In Findings of the Association for Computational Linguistics: ACL, pages 7147– 7164, 2023. 10

  27. [28]

    Exploring speech recognition, translation, and understanding with discrete speech units: A comparative study

    Xuankai Chang, Brian Yan, Kwanghee Choi, Jee-Weon Jung, Yichen Lu, Soumi Maiti, Roshan Sharma, Jiatong Shi, Jinchuan Tian, Shinji Watanabe, Yuya Fujita, et al. Exploring speech recognition, translation, and understanding with discrete speech units: A comparative study. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), p...

  28. [29]

    High fidelity neural audio compression

    Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. Transactions on Machine Learning Research (TMLR), 2023

  29. [30]

    High-fidelity au- dio compression with improved RVQGAN

    Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High-fidelity au- dio compression with improved RVQGAN. In International Conference on Neural Information Processing Systems (NeurIPS), 2023

  30. [31]

    Speech self-supervised representation benchmarking: Are we doing it right? In Interspeech, pages 2873–2877, 2023

    Salah Zaiem, Youcef Kemiche, Titouan Parcollet, Slim Essid, and Mirco Ravanelli. Speech self-supervised representation benchmarking: Are we doing it right? In Interspeech, pages 2873–2877, 2023

  31. [32]

    SpeechBrain: A general- purpose speech toolkit

    Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, et al. SpeechBrain: A general- purpose speech toolkit. arXiv preprint arXiv:2106.04624, 2021

  32. [33]

    Exploration of efficient end-to-end ASR using discretized input from self-supervised learning

    Xuankai Chang, Brian Yan, Yuya Fujita, Takashi Maekaku, and Shinji Watanabe. Exploration of efficient end-to-end ASR using discretized input from self-supervised learning. In Interspeech, pages 1399–1403, 2023

  33. [34]

    Towards universal speech discrete tokens: A case study for ASR and TTS

    Yifan Yang, Feiyu Shen, Chenpeng Du, Ziyang Ma, Kai Yu, Daniel Povey, and Xie Chen. Towards universal speech discrete tokens: A case study for ASR and TTS. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10401–10405, 2024

  34. [35]

    TokenSplit: Using discrete speech representations for direct, refined, and transcript- conditioned speech separation and recognition

    Hakan Erdogan, Scott Wisdom, Xuankai Chang, Zalán Borsos, Marco Tagliasacchi, Neil Zeghidour, and John R Hershey. TokenSplit: Using discrete speech representations for direct, refined, and transcript- conditioned speech separation and recognition. In Interspeech, pages 3462–3466, 2023

  35. [36]

    Evaluating text-to-speech synthesis from a large discrete token-based speech language model

    Siyang Wang and Éva Székely. Evaluating text-to-speech synthesis from a large discrete token-based speech language model. In Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING), pages 6464–6474, 2024

  36. [37]

    Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

    Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023

  37. [38]

    Speak, read and prompt: High-fidelity text-to-speech with minimal supervision

    Eugene Kharitonov, Damien Vincent, Zalán Borsos, Raphaël Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, and Neil Zeghidour. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Transactions of the Association for Computational Linguistics, 11:1703–1718, 2023

  38. [39]

    How should we extract discrete audio tokens from self-supervised models?, 2024

    Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca Della Libera, Artem Ploujnikov, Cem Subakan, and Mirco Ravanelli. How should we extract discrete audio tokens from self-supervised models?, 2024

  39. [40]

    Lin, Andy T

    Shu wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y . Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, et al. SUPERB: Speech Processing Universal PERfor- mance Benchmark. In Interspeech, pages 1194–1198, 2021

  40. [41]

    Definition of the Opus audio codec

    Jean-Marc Valin, Koen V os, and Timothy Terriberry. Definition of the Opus audio codec. Technical report, 2012

  41. [42]

    Overview of the EVS codec architecture

    Martin Dietz, Markus Multrus, Vaclav Eksler, Vladimir Malenovsky, Erik Norvell, Harald Pobloth, Lei Miao, Zhe Wang, Lasse Laaksonen, Adriana Vasilache, et al. Overview of the EVS codec architecture. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5698–5702, 2015

  42. [43]

    SoundStream: An end-to-end neural audio codec

    Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. SoundStream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pages 495–507, 2021

  43. [44]

    AudioLM: A language modeling approach to audio generation

    Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. AudioLM: A language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2523–2533, 2023

  44. [45]

    HiFi- Codec: Group-residual vector quantization for high fidelity audio codec

    Dongchao Yang, Songxiang Liu, Rongjie Huang, Jinchuan Tian, Chao Weng, and Yuexian Zou. HiFi- Codec: Group-residual vector quantization for high fidelity audio codec. arXiv preprint arXiv:2305.02765, 2023

  45. [46]

    Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license

    Matˇej Korvas, Ondˇrej Plátek, Ondˇrej Dušek, Lukáš Žilka, and Filip Jur ˇcíˇcek. Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license. InInternational Conference on Language Resources and Evaluation (LREC), pages 4423–4428, 2014. 11

  46. [47]

    ContextNet: Improving convolutional neural networks for automatic speech recognition with global context

    Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, and Yonghui Wu. ContextNet: Improving convolutional neural networks for automatic speech recognition with global context. In Interspeech, pages 3610–3614, 2020

  47. [48]

    Common V oice: A massively-multilingual speech corpus

    Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. Common V oice: A massively-multilingual speech corpus. In Language Resources and Evaluation Conference (LREC), pages 4218–4222, 2020

  48. [49]

    V oxCeleb: A large-scale speaker identification dataset

    Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. V oxCeleb: A large-scale speaker identification dataset. In Interspeech, pages 2616–2620, 2017

  49. [50]

    X-vectors: Robust DNN embeddings for speaker recognition

    David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. X-vectors: Robust DNN embeddings for speaker recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5329–5333, 2018

  50. [51]

    Additive margin softmax for face verification

    Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu. Additive margin softmax for face verification. IEEE Signal Processing Letters, 25(7):926–930, 2018

  51. [52]

    ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification

    Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In Interspeech, pages 3830– 3834, 2020

  52. [53]

    IEMOCAP: Interactive emotional dyadic motion capture database

    Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359, 2008

  53. [54]

    SLURP: A spoken language understanding resource package

    Emanuele Bastianelli, Andrea Vanzo, Pawel Swietojanski, and Verena Rieser. SLURP: A spoken language understanding resource package. In Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7252–7262, 2020

  54. [55]

    Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

    Pete Warden. Speech Commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018

  55. [56]

    Investigating RNN-based speech enhancement methods for noise-robust text-to-speech

    Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, and Junichi Yamagishi. Investigating RNN-based speech enhancement methods for noise-robust text-to-speech. In Speech Synthesis Workshop, pages 146–152, 2016

  56. [57]

    Conformer: Convolution-augmented transformer for speech recognition

    Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. Conformer: Convolution-augmented transformer for speech recognition. In Interspeech, pages 5036–5040, 2020

  57. [58]

    DNSMOS P.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors

    Chandan KA Reddy, Vishak Gopal, and Ross Cutler. DNSMOS P.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022

  58. [59]

    Sequential multi-frame neural beamforming for speech separation and enhancement

    Zhong-Qiu Wang et al. Sequential multi-frame neural beamforming for speech separation and enhancement. In IEEE Spoken Language Technology (SLT) Workshop, pages 905–911, 2021

  59. [60]

    Robust Speech Recognition via Large-Scale Weak Supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022

  60. [61]

    LibriMix: An open-source dataset for generalizable speech separation

    Joris Cosentino, Manuel Pariente, Samuele Cornell, Antoine Deleforge, and Emmanuel Vincent. LibriMix: An open-source dataset for generalizable speech separation. arXiv preprint arXiv:2005.11262, 2020

  61. [62]

    Kolbæk, D

    M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25:1901–1913, 2017

  62. [63]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In International Conference on Neural Information Processing Systems (NeurIPS), pages 6000–6010, 2017

  63. [64]

    The LJ speech dataset

    Keith Ito. The LJ speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017

  64. [65]

    UTMOS: UTokyo-SaruLab System for V oiceMOS Challenge 2022

    Saeki Takaaki et al. UTMOS: UTokyo-SaruLab System for V oiceMOS Challenge 2022. InInterspeech, pages 4521–4525, 2022

  65. [66]

    A comparison of discrete and soft speech units for improved voice conversion

    Benjamin van Niekerk, Marc-André Carbonneau, Julian Zaïdi, Matthew Baas, Hugo Seuté, and Herman Kamper. A comparison of discrete and soft speech units for improved voice conversion. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6562–6566, 2022

  66. [67]

    Gebru, Dejan Markovi ´c, and Alexander Richard

    Yi-Chiao Wu, Israel D. Gebru, Dejan Markovi ´c, and Alexander Richard. Audiodec: An open-source streaming high-fidelity neural audio codec. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023

  67. [68]

    ICASSP 2023 deep noise suppression challenge

    Harishchandra Dubey, Ashkan Aazami, Vishak Gopal, Babak Naderi, Sebastian Braun, Ross Cutler, Alex Ju, Mehdi Zohourian, Min Tang, Mehrsa Golestaneh, et al. ICASSP 2023 deep noise suppression challenge. IEEE Open Journal of Signal Processing, 2024. 12

  68. [69]

    Audio Set: An ontology and human-labeled dataset for audio events

    Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio Set: An ontology and human-labeled dataset for audio events. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780, 2017

  69. [70]

    FSD50K: an open dataset of human-labeled sound events

    Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:829–852, 2021

  70. [71]

    The MTG-Jamendo dataset for automatic music tagging

    Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, and Xavier Serra. The MTG-Jamendo dataset for automatic music tagging. In International Conference on Machine Learning (ICML), 2019

  71. [72]

    Gautham J Mysore. Can we automatically transform speech recorded on common consumer devices in real-world environments into professional production quality speech?—A dataset, insights, and challenges. IEEE Signal Processing Letters, 22(8):1006–1010, 2014

  72. [73]

    CSTR VCTK Corpus: English multi- speaker corpus for CSTR voice cloning toolkit (version 0.92), 2019

    Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. CSTR VCTK Corpus: English multi- speaker corpus for CSTR voice cloning toolkit (version 0.92), 2019

  73. [74]

    The MUSDB18 corpus for music separation

    Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The MUSDB18 corpus for music separation. https://doi.org/10.5281/zenodo.1117372, 2017

  74. [75]

    Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A

    Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, and Yonghui Wu. Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (I...

  75. [76]

    Not Converged

    Hideyuki Tachibana, Katsuya Uenoyama, and Shunsuke Aihara. Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4784–4788, 2018. 13 A General Information A.1 Computational Resources We designed our benchmark to be ...