pith. sign in

arxiv: 2605.19541 · v1 · pith:326AKCKYnew · submitted 2026-05-19 · 💻 cs.SD

Optimising Neural Speech Codecs for 300bps Communication using Reinforcement Learning

Pith reviewed 2026-05-20 02:04 UTC · model grok-4.3

classification 💻 cs.SD
keywords neural speech codecreinforcement learninglow bitrateword error ratespeech intelligibility300 bpsLibriSpeech
0
0 comments X

The pith

A 300-bps neural speech codec reaches 3.55 percent word error rate on clean audio after reinforcement learning fine-tunes its encoder for intelligibility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ClariCodec, a neural speech codec built for ultra-low-bitrate channels such as satellite links where intelligibility matters more than acoustic fidelity. It reframes the quantization process as a stochastic policy so that reinforcement learning can adjust the encoder using rewards based on word error rate from an automatic recognizer. The rest of the reconstruction pipeline stays frozen throughout this adjustment. The base model already produces competitive error rates at 300 bps, and the reinforcement learning step delivers a further 23 percent relative reduction in word error rate on LibriSpeech test sets while leaving perceptual quality unchanged. This matters because conventional training at extreme compression wastes bits on details that do not help listeners understand spoken words.

Core claim

ClariCodec reformulates quantisation as a stochastic policy, enabling reinforcement learning to fine-tune only the encoder with word error rate rewards while the acoustic reconstruction pipeline remains frozen, yielding 4.64 percent WER without reinforcement learning and 3.55 percent WER with it on the LibriSpeech test-clean set together with a 23 percent relative reduction on both test-clean and test-other sets.

What carries the argument

Reformulation of quantization as a stochastic policy that permits reinforcement learning to optimize the encoder using word error rate rewards while the decoder pipeline stays fixed.

If this is right

  • Intelligibility-focused optimization at 300 bps becomes practical without retraining the full codec.
  • The same encoder-only reinforcement learning adjustment applies to other constrained channels such as underwater links.
  • Perceptual quality metrics stay stable even when training targets word accuracy instead of acoustic detail.
  • Competitive performance is possible at bitrates far below those of conventional neural codecs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar reinforcement learning on intelligibility rewards could be tested on other low-bitrate audio or multimodal signals.
  • Real-time channel adaptation might combine this encoder tuning with dynamic bitrate allocation.
  • The approach invites direct comparison against end-to-end trainable codecs that do not freeze the decoder.

Load-bearing premise

Word error rate measured by an automatic recognizer accurately reflects human intelligibility, and adjusting only the encoder will not introduce inconsistencies or artifacts when the frozen reconstruction pipeline is used for actual transmission.

What would settle it

Human listeners transcribing speech from the reinforcement-learning version produce higher word error rates or report more artifacts than from the non-reinforced version in a controlled blind test.

Figures

Figures reproduced from arXiv: 2605.19541 by Chao Zhang, Chi Zhang, Haifeng Luo, Hao Wang, Jing Qian, Junyi Wang, Zengrui Jin.

Figure 1
Figure 1. Figure 1: Overview of the two-stage training framework of ClariCodec. In Stage 1, the full codec is trained end-to-end using a combination of L1 mel reconstruction loss, adversarial loss, and feature matching loss to ensure high-fidelity speech reconstruction. In Stage 2, all modules excpet the encoder are frozen, and the encoder is fine-tuned using an RL objective where the reward signal is derived from a pretraine… view at source ↗
read the original abstract

In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to allocate bits to perceptual detail, leading to substantial degradation in word error rate (WER). This paper proposes ClariCodec, a neural speech codec operating at 300 bit per second (bps) that reformulates quantisation as a stochastic policy, enabling reinforcement learning (RL)-based optimisation of intelligibility. Specifically, the encoder is fine-tuned using WER-driven rewards while the acoustic reconstruction pipeline remains frozen. Even without RL, ClariCodec achieves 4.64% WER on the LibriSpeech test-clean set at 300 bps, already competitive with codecs operating at higher bitrates. Further RL fine-tuning reduces WER to 3.55% on test-clean and 10.4% on test-other, corresponding to a 23% relative reduction while preserving perceptual quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ClariCodec, a neural speech codec for 300 bps operation in bandwidth-constrained settings. It reformulates quantization as a stochastic policy to permit reinforcement learning fine-tuning of the encoder using an external WER reward, while the acoustic reconstruction pipeline remains frozen. The authors report that the base model achieves 4.64% WER on LibriSpeech test-clean; after RL, this improves to 3.55% on test-clean and 10.4% on test-other (23% relative reduction) while preserving perceptual quality.

Significance. If the experimental claims are substantiated, the work would be significant for ultra-low-bitrate speech communication in satellite and underwater channels, where intelligibility is paramount. The RL-driven optimization of an intelligibility proxy without retraining the full decoder is a targeted and potentially efficient technique; the competitive WER at 300 bps even before RL is noteworthy and could influence future codec design.

major comments (2)
  1. [Abstract / Results] Abstract and results: The headline 23% relative WER reduction (4.64% to 3.55% on test-clean) is presented without any description of the ASR system used to compute the reward, the number of RL training runs, statistical significance tests, or ablation studies isolating the RL contribution. Because the reward is an external metric, these omissions make it impossible to verify that the reported gain is robust rather than an artifact of a particular recognizer or training seed.
  2. [Abstract] Abstract: The claim that perceptual quality is preserved after encoder-only RL fine-tuning is load-bearing for the overall contribution, yet no supporting evidence (human listening tests, PESQ/STOI scores, or analysis of decoder outputs on the shifted latent distribution) is supplied. Without such checks, it remains possible that the WER improvement trades off against new reconstruction artifacts once the frozen decoder is used in transmission.
minor comments (1)
  1. [Abstract] The abstract would benefit from a brief statement of the underlying neural codec architecture (e.g., number of layers, latent dimension) and the precise bit allocation at 300 bps to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and indicate the revisions planned to improve clarity and substantiation of the results.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and results: The headline 23% relative WER reduction (4.64% to 3.55% on test-clean) is presented without any description of the ASR system used to compute the reward, the number of RL training runs, statistical significance tests, or ablation studies isolating the RL contribution. Because the reward is an external metric, these omissions make it impossible to verify that the reported gain is robust rather than an artifact of a particular recognizer or training seed.

    Authors: We acknowledge that additional experimental details are needed to demonstrate robustness. In the revised manuscript we will expand the methods and results sections to describe the ASR system used for the WER reward, specify the number of RL training runs performed, report statistical significance (including standard deviations across runs and appropriate hypothesis tests), and include ablation studies that isolate the RL fine-tuning contribution from the base encoder. These additions will enable readers to assess whether the observed WER improvement is reliable. revision: yes

  2. Referee: [Abstract] Abstract: The claim that perceptual quality is preserved after encoder-only RL fine-tuning is load-bearing for the overall contribution, yet no supporting evidence (human listening tests, PESQ/STOI scores, or analysis of decoder outputs on the shifted latent distribution) is supplied. Without such checks, it remains possible that the WER improvement trades off against new reconstruction artifacts once the frozen decoder is used in transmission.

    Authors: We agree that direct evidence is required to support the preservation of perceptual quality. Although the decoder remains frozen, we will add objective metrics (PESQ and STOI) computed on reconstructions from both the base and RL-tuned encoders. We will also include an analysis of decoder outputs on the shifted latent distribution to check for introduced artifacts. Where space permits, we will incorporate limited subjective listening test results. These revisions will directly address the possibility of quality trade-offs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; external WER reward keeps derivation independent

full rationale

The paper reformulates quantization as a stochastic policy and applies RL fine-tuning to the encoder using WER from a separate ASR as the reward signal, with the reconstruction pipeline held frozen. The reported WER drop on test-clean and test-other is the measured outcome of that optimization on held-out data rather than a quantity that reduces by the paper's own equations to a fitted parameter or self-defined input. No self-definitional steps, fitted-input predictions, load-bearing self-citations, or ansatz smuggling appear in the abstract or described method. The central result therefore remains statistically independent of its inputs and receives a score of 0.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard neural codec architectures and RL training assumptions without introducing new physical entities or circular derivations.

free parameters (1)
  • WER reward scaling factor
    The relative weight given to the word-error-rate reward during RL fine-tuning is a hyperparameter that must be chosen or tuned on held-out data.
axioms (1)
  • domain assumption Word error rate from an automatic speech recognizer serves as a reliable proxy for human intelligibility in the target use cases.
    Invoked when the encoder is fine-tuned using WER-driven rewards.

pith-pipeline@v0.9.0 · 5719 in / 1305 out tokens · 55373 ms · 2026-05-20T02:04:06.814989+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 4 internal anchors

  1. [1]

    Introduction In bandwidth-constrained and reliability-limited environments such as satellite and underwater communication, the available transmission capacity may be restricted to only a few hundred bits per second (bps) [1, 2]. Under such conditions, the objec- tive of speech coding shifts from preserving waveform fidelity to ensuring speech intelligibil...

  2. [2]

    This work tackles this challenge through a two-stage training strategy that explicitly optimises for semantic information

    Method Achieving intelligible speech compression at 300bps requires balancing acoustic fidelity against the preservation of seman- tic information, a trade-off that standard reconstruction objec- tives fail to address effectively. This work tackles this challenge through a two-stage training strategy that explicitly optimises for semantic information. 2.1...

  3. [3]

    Datasets The large subset of Libriheavy [59], comprising 50,000 hours of speech, is used for training

    Experimental Setup 3.1. Datasets The large subset of Libriheavy [59], comprising 50,000 hours of speech, is used for training. Evaluation is conducted on the test-cleanandtest-othersubsets of LibriSpeech [50]. All audio is single-channel at 16 kHz. 3.2. Metrics and Baselines System performance is evaluated across speech intelligibility and acoustic qualit...

  4. [4]

    #Param” and “#hours

    Experimental Results 4.1. Main Results Table 1 compares ClariCodec against existing neural speech codecs across a range of bitrates (312.5–750 bps). ClariCodec operates at 300 bps, the lowest bitrate among all evaluated systems, with the primary objective of improving speech in- telligibility under extreme compression.(1)Despite operat- 4https://huggingfa...

  5. [5]

    To address this, we presented Clari- Codec, a neural speech codec operating at 300 bps that incor- porates reinforcement learning to explicitly optimise seman- tic retention

    Conclusions Maintaining speech intelligibility at ultra-low bitrates remains a fundamental challenge for neural speech codecs in bandwidth- constrained environments. To address this, we presented Clari- Codec, a neural speech codec operating at 300 bps that incor- porates reinforcement learning to explicitly optimise seman- tic retention. By reformulating...

  6. [6]

    After using this tool, the authors carefully reviewed and edited the manuscript, and take full re- sponsibility for the final content of the paper

    Generative AI Use Disclosure During the preparation of this manuscript, the authors used gen- erative AI to polish the English language, correct grammar, and improve overall readability. After using this tool, the authors carefully reviewed and edited the manuscript, and take full re- sponsibility for the final content of the paper

  7. [7]

    Underwater acoustic communica- tion channels: Propagation models and statistical characteriza- tion,

    M. Stojanovic and J. Preisig, “Underwater acoustic communica- tion channels: Propagation models and statistical characteriza- tion,”IEEE Communications Magazine, vol. 47, no. 1, pp. 84–89, 2009

  8. [8]

    Low-resource audio codec (LRAC): 2025 chal- lenge description,

    K. Wojcicki, Y . Z. Isik, L. Lechler, M. Yesilbursa, I. Bali ´c, W. Mack, R. Łaganowski, G. Zhang, Y . Adi, M. Kim, and S. Watanabe, “Low-resource audio codec (LRAC): 2025 chal- lenge description,”arXiv preprint arXiv:2510.23312, 2025

  9. [9]

    SoundStream: An end-to-end neural audio codec,

    N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: An end-to-end neural audio codec,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 30, pp. 495–507, 2022

  10. [10]

    High fidelity neural audio compression,

    A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neural audio compression,”Transactions on Machine Learning Research, 2023

  11. [11]

    Finite scalar quantization: Vq-vae made simple,

    F. Mentzer, D. Minnen, E. Agustsson, and M. Tschannen, “Finite scalar quantization: Vq-vae made simple,” inProc. ICLR, Vienna, 2024

  12. [12]

    High-fidelity audio compression with improved RVQGAN,

    R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved RVQGAN,” in Proc. NeurIPS, New Orleans, 2023

  13. [13]

    ESC: Efficient speech coding with cross- scale residual vector quantized Transformers,

    Y . Gu and E. Diao, “ESC: Efficient speech coding with cross- scale residual vector quantized Transformers,” inProc. EMNLP, Miami, 2024

  14. [14]

    NDVQ: Robust neural audio codec with normal distribution-based vector quantization,

    Z. Niu, S. Chen, L. Zhou, Z. Ma, X. Chen, and S. Liu, “NDVQ: Robust neural audio codec with normal distribution-based vector quantization,” inProc. SLT, Macao, 2024

  15. [15]

    Variable bitrate residual vector quantization for audio cod- ing,

    Y . Chae, W. Choi, Y . Takida, J. Koo, Y . Ikemiya, Z. Zhong, K. W. Cheuk, M. A. Mart´ınez-Ram´ırez, K. Lee, W.-H. Liao, and Y . Mit- sufuji, “Variable bitrate residual vector quantization for audio cod- ing,” inProc. ICASSP, Suzhou, 2025

  16. [16]

    Hifi-codec: Group-residual vector quantization for high fidelity audio codec.arXiv preprint arXiv:2305.02765, 2023

    D. Yang, S. Liu, R. Huang, J. Tian, C. Weng, and Y . Zou, “HiFi- Codec: Group-residual vector quantization for high fidelity audio codec,”arXiv preprint arXiv:2305.02765, 2023

  17. [17]

    Srcodec: Split-residual vector quantization for neural speech codec,

    Y . Zheng, W. Tu, L. Xiao, and X. Xu, “Srcodec: Split-residual vector quantization for neural speech codec,” inProc. ICASSP, Seoul, 2024

  18. [18]

    SNAC: Multi- scale neural audio codec,

    H. Siuzdak, F. Gr ¨otschla, and L. A. Lanzend¨orfer, “SNAC: Multi- scale neural audio codec,” inAudio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation, Van- couver, 2024

  19. [19]

    SoCodec: A semantic-ordered multi-stream speech codec for ef- ficient language model based text-to-speech synthesis,

    H. Guo, F. Xie, K. Xie, D. Yang, D. Guo, X. Wu, and H. Meng, “SoCodec: A semantic-ordered multi-stream speech codec for ef- ficient language model based text-to-speech synthesis,” inProc. SLT, Macao, 2024

  20. [20]

    TS3-Codec: Transformer-based simple streaming single codec,

    H. Wu, N. Kanda, S. E. Eskimez, and J. Li, “TS3-Codec: Transformer-based simple streaming single codec,” inProc. In- terspeech, Rotterdam, 2025

  21. [21]

    Generative de-quantization for neural speech codec via latent diffusion,

    H. Yang, I. Jang, and M. Kim, “Generative de-quantization for neural speech codec via latent diffusion,” inProc. ICASSP, Seoul, 2024

  22. [22]

    APCodec: A neural audio codec with parallel amplitude and phase spec- trum encoding and decoding,

    Y . Ai, X.-H. Jiang, Y .-X. Lu, H.-P. Du, and Z.-H. Ling, “APCodec: A neural audio codec with parallel amplitude and phase spec- trum encoding and decoding,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3256–3269, 2024

  23. [23]

    A low-bitrate neural audio codec framework with band- width reduction and recovery for high-sampling-rate waveforms,

    Y . Ai, Y .-X. Lu, X.-H. Jiang, Z.-Y . Sheng, R.-C. Zheng, and Z.-H. Ling, “A low-bitrate neural audio codec framework with band- width reduction and recovery for high-sampling-rate waveforms,” inProc. Interspeech, Kos Island, 2024

  24. [24]

    HILCodec: High-fidelity and lightweight neural audio codec,

    S. Ahn, B. J. Woo, M. H. Han, C. Moon, and N. S. Kim, “HILCodec: High-fidelity and lightweight neural audio codec,” IEEE Journal of Selected Topics in Signal Processing, vol. 18, no. 8, pp. 1517–1530, 2024

  25. [25]

    SuperCodec: A neural speech codec with selective back-projection network,

    Y . Zheng, W. Tu, L. Xiao, and X. Xu, “SuperCodec: A neural speech codec with selective back-projection network,” inProc. ICASSP, Seoul, 2024

  26. [26]

    WavTokenizer: An efficient acoustic discrete codec tokenizer for audio language modeling,

    S. Ji, Z. Jiang, W. Wang, Y . Chen, M. Fang, J. Zuo, Q. Yang, X. Cheng, Z. Wang, R. Li, Z. Zhang, X. Yang, R. Huang, Y . Jiang, Q. Chen, S. Zheng, and Z. Zhao, “WavTokenizer: An efficient acoustic discrete codec tokenizer for audio language modeling,” inProc. ICLR, Singapore, 2025

  27. [27]

    Bigcodec: Pushing the limits of low-bitrate neural speech codec

    D. Xin, X. Tan, S. Takamichi, and H. Saruwatari, “BigCodec: Pushing the limits of low-bitrate neural speech codec,”arXiv preprint arXiv:2409.05377, 2024

  28. [28]

    Latent-domain predictive neural speech coding,

    X. Jiang, X. Peng, H. Xue, Y . Zhang, and Y . Lu, “Latent-domain predictive neural speech coding,”IEEE/ACM Transactions on Au- dio, Speech, and Language Processing, vol. 31, pp. 2111–2123, 2023

  29. [29]

    LMCodec: A low bitrate speech codec with causal Transformer models,

    T. Jenrungrot, M. Chinen, W. B. Kleijn, J. Skoglund, Z. Borsos, N. Zeghidour, and M. Tagliasacchi, “LMCodec: A low bitrate speech codec with causal Transformer models,” inProc. ICASSP, Rhodes Island, 2023

  30. [30]

    Learning source disentanglement in neural audio codec,

    X. Bie, X. Liu, and G. Richard, “Learning source disentanglement in neural audio codec,” inProc. ICASSP, Suzhou, 2025

  31. [31]

    Low Frame-rate Speech Codec: A codec designed for fast high-quality speech LLM train- ing and inference,

    E. Casanova, R. Langman, P. Neekhara, S. Hussain, J. Li, S. Ghosh, A. Juki ´c, and S.-G. Lee, “Low Frame-rate Speech Codec: A codec designed for fast high-quality speech LLM train- ing and inference,” inProc. ICASSP, Hyderabad, 2025

  32. [32]

    SemantiCodec: An ultra low bitrate semantic audio codec for general sound,

    H. Liu, X. Xu, Y . Yuan, M. Wu, W. Wang, and M. D. Plumbley, “SemantiCodec: An ultra low bitrate semantic audio codec for general sound,”IEEE Journal of Selected Topics in Signal Pro- cessing, vol. 18, no. 8, pp. 1448–1461, 2024

  33. [33]

    FlexiCodec: A dynamic neural audio codec for low frame rates,

    J. Li, Y . Qian, Y . Hu, L. Zhang, X. Wang, H. Lu, M. Thakker, J. Li, S. Zhao, and Z. Wu, “FlexiCodec: A dynamic neural audio codec for low frame rates,” inProc. ICLR, Rio de Janeiro, 2026

  34. [34]

    Scaling Transformers for low-bitrate high-quality speech coding,

    J. D. Parker, A. Smirnov, J. Pons, C. Carr, Z. Zukowski, Z. Evans, and X. Liu, “Scaling Transformers for low-bitrate high-quality speech coding,” inProc. ICLR, Singapore, 2025

  35. [35]

    Ultra-low-bitrate speech coding with pretrained Transformers,

    A. Siahkoohi, M. Chinen, T. Denton, W. B. Kleijn, and J. Skoglund, “Ultra-low-bitrate speech coding with pretrained Transformers,” inProc. Interspeech, Incheon, 2022

  36. [36]

    Promptcodec: High-fidelity neural speech codec using disentangled representation learning based adaptive feature-aware prompt encoders

    Y . Pan, L. Ma, and J. Zhao, “PromptCodec: High-fidelity neural speech codec using disentangled representation learning based adaptive feature-aware prompt encoders,”arXiv preprint arXiv:2404.02702, 2024

  37. [37]

    LSCodec: Low-bitrate and speaker-decoupled discrete speech codec,

    Y . Guo, Z. Li, C. Du, H. Wang, X. Chen, and K. Yu, “LSCodec: Low-bitrate and speaker-decoupled discrete speech codec,” in Proc. Interspeech, Rotterdam, 2025

  38. [38]

    Single-Codec: Single-codebook speech codec towards high-performance speech generation,

    H. Li, L. Xue, H. Guo, X. Zhu, Y . Lv, L. Xie, Y . Chen, H. Yin, and Z. Li, “Single-Codec: Single-codebook speech codec towards high-performance speech generation,” inProc. Interspeech, Kos Island, 2024

  39. [39]

    Disentangled feature learning for real-time neural speech coding,

    X. Jiang, X. Peng, Y . Zhang, and Y . Lu, “Disentangled feature learning for real-time neural speech coding,” inProc. ICASSP, Rhodes Island, 2023

  40. [40]

    Fewer-token neural speech codec with time-invariant codes,

    Y . Ren, T. Wang, J. Yi, L. Xu, J. Tao, C. Y . Zhang, and J. Zhou, “Fewer-token neural speech codec with time-invariant codes,” in Proc. ICASSP, Seoul, 2024

  41. [41]

    FreeCodec: A disentangled neural speech codec with fewer tokens,

    Y . Zheng, W. Tu, Y . Kang, J. Chen, Y . Zhang, L. Xiao, Y . Yang, and L. Ma, “FreeCodec: A disentangled neural speech codec with fewer tokens,” inProc. Interspeech, Rotterdam, 2025

  42. [42]

    SpeechTok- enizer: Unified speech tokenizer for speech language models,

    X. Zhang, D. Zhang, S. Li, Y . Zhou, and X. Qiu, “SpeechTok- enizer: Unified speech tokenizer for speech language models,” in Proc. ICLR, Vienna, 2024

  43. [43]

    Moshi: a speech-text foundation model for real-time dialogue

    A. D ´efossez, L. Mazar´e, M. Orsini, A. Royer, P. P´erez, H. J´egou, E. Grave, and N. Zeghidour, “Moshi: A speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024

  44. [44]

    High-fidelity generative audio compression at 0.275 kbps,

    H. Ma, R. Jing, S. Liu, C. Gong, C. Zhang, X.-L. Zhang, and X. Li, “High-fidelity generative audio compression at 0.275 kbps,” arXiv preprint arXiv:2602.00648, 2026

  45. [45]

    Codec does matter: Explor- ing the semantic shortcoming of codec for audio language model,

    Z. Ye, P. Sun, J. Lei, H. Lin, X. Tan, Z. Dai, Q. Kong, J. Chen, J. Pan, Q. Liu, Y . Guo, and W. Xue, “Codec does matter: Explor- ing the semantic shortcoming of codec for audio language model,” inProc. AAAI, Philadelphia, 2025

  46. [46]

    Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis

    Z. Ye, X. Zhu, C.-M. Chan, X. Wang, X. Tan, J. Lei, Y . Peng, H. Liu, Y . Jin, Z. Dai, H. Lin, J. Chen, X. Du, L. Xue, Y . Chen, Z. Li, L. Xie, Q. Kong, Y . Guo, and W. Xue, “LLaSa: Scaling train-time and inference-time compute for LLaMa-based speech synthesis,”arXiv preprint arXiv:2502.04128, 2025

  47. [47]

    MMM: Multi-layer multi-residual multi-stream discrete speech represen- tation from self-supervised learning model,

    J. Shi, X. Ma, H. Inaguma, A. Sun, and S. Watanabe, “MMM: Multi-layer multi-residual multi-stream discrete speech represen- tation from self-supervised learning model,” inProc. Interspeech, Kos Island, 2024

  48. [48]

    How should we extract discrete audio tokens from self-supervised models?

    P. Mousavi, J. Duret, S. Zaiem, L. Della Libera, A. Ploujnikov, C. Subakan, and M. Ravanelli, “How should we extract discrete audio tokens from self-supervised models?” inProc. Interspeech, Kos Island, 2024

  49. [49]

    SAC: Neural speech codec with semantic-acoustic dual-stream quantization,

    W. Chen, X. Wang, R. Yan, Y . Chen, Z. Niu, Z. Ma, X. Li, Y . Liang, H. Wen, S. Yin, M. Tao, and X. Chen, “SAC: Neural speech codec with semantic-acoustic dual-stream quantization,” arXiv preprint arXiv:2510.16841, 2025

  50. [50]

    UniAudio 1.5: Large language model-driven au- dio codec is a few-shot audio task learner,

    D. Yang, H. Guo, Y . Wang, R. Huang, X. Li, X. Tan, X. Wu, and H. Meng, “UniAudio 1.5: Large language model-driven au- dio codec is a few-shot audio task learner,” inProc. NeurIPS, Van- couver, 2024

  51. [51]

    The information bottleneck method

    N. Tishby, F. C. Pereira, and W. Bialek, “The information bottle- neck method,”arXiv preprint physics/0004057, 2000

  52. [52]

    On the statistics of spoken English,

    P. B. Denes, “On the statistics of spoken English,”The Journal of the Acoustical Society of America, vol. 35, no. 6, pp. 892–904, 1963

  53. [53]

    On the infor- mation rate of speech communication,

    S. Van Kuyk, W. B. Kleijn, and R. C. Hendriks, “On the infor- mation rate of speech communication,” inProc. ICASSP, New Orleans, 2017

  54. [54]

    iFSQ: Improving FSQ for image generation with 1 line of code,

    B. Lin, Z. Li, Y . Niu, K. Gong, Y . Ge, Y . Lin, M. Zheng, J. Zhang, M. Yang, Z. Zhong, L. Bo, and L. Yuan, “iFSQ: Improving FSQ for image generation with 1 line of code,”arXiv preprint arXiv:2601.17124, 2026

  55. [55]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,” arXiv preprint arXiv:2402.03300, 2024

  56. [56]

    Lib- riSpeech: An ASR corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- riSpeech: An ASR corpus based on public domain audio books,” inProc. ICASSP, Singapore, 2022

  57. [57]

    ConvNeXt V2: Co-designing and scaling convnets with masked autoencoders,

    S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “ConvNeXt V2: Co-designing and scaling convnets with masked autoencoders,” inProc. CVPR, Vancouver, 2023

  58. [58]

    V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis,

    H. Siuzdak, “V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis,” inProc. ICLR, Vienna, 2024

  59. [59]

    UTMOS: UTokyo-Sarulab system for V oiceMOS challenge 2022,

    T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-Sarulab system for V oiceMOS challenge 2022,” inProc. Interspeech, Incheon, 2022

  60. [60]

    Categorical reparameterization with gumbel-softmax,

    E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” inProc. ICLR, Toulon, 2017

  61. [61]

    Spectral nor- malization for generative adversarial networks,

    T. Miyato, T. Kataoka, M. Koyama, and Y . Yoshida, “Spectral nor- malization for generative adversarial networks,” inProc. ICLR, Vancouver, 2018

  62. [62]

    HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,

    J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” inProc. NeurIPS, 2020, pp. 19 655–19 666

  63. [63]

    UnivNet: A neu- ral vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation,

    W. Jang, D. Lim, J. Yoon, B. Kim, and J. Kim, “UnivNet: A neu- ral vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation,” inProc. Interspeech, Brno, 2021

  64. [64]

    Mel- GAN: Generative adversarial networks for conditional waveform synthesis,

    K. Kumar, R. Kumar, T. De Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. De Brebisson, Y . Bengio, and A. C. Courville, “Mel- GAN: Generative adversarial networks for conditional waveform synthesis,” inProc. NeurIPS, Vancouver, 2019

  65. [65]

    Libriheavy: A 50,000 hours ASR corpus with punctu- ation casing and context,

    W. Kang, X. Yang, Z. Yao, F. Kuang, Y . Yang, L. Guo, L. Lin, and D. Povey, “Libriheavy: A 50,000 hours ASR corpus with punctu- ation casing and context,” inProc. ICASSP, Seoul, 2024

  66. [66]

    A short- time objective intelligibility measure for time-frequency weighted noisy speech,

    C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short- time objective intelligibility measure for time-frequency weighted noisy speech,” inProc. ICASSP, Dallas, 2010

  67. [67]

    Conformer: Convolution-augmented Transformer for speech recognition,

    A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution-augmented Transformer for speech recognition,” in Proc. Interspeech, Shanghai, 2020

  68. [68]

    Perceptual eval- uation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,

    A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual eval- uation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” inProc. ICASSP, Salt Lake City, 2001

  69. [69]

    WavLM: Large-scale self-supervised pre-training for full stack speech processing,

    S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

  70. [70]

    Fast conformer with linearly scalable attention for efficient speech recognition,

    D. Rekesh, N. R. Koluguri, S. Kriman, S. Majumdar, V . Noroozi, H. Huang, O. Hrinchuk, K. Puvvada, A. Kumar, J. Balam, and B. Ginsburg, “Fast conformer with linearly scalable attention for efficient speech recognition,” inProc. ASRU, Taipei, 2023

  71. [71]

    Decoupled weight decay regulariza- tion,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza- tion,” inProc. ICLR, New Orleans, 2019