pith. machine review for the scientific record. sign in

arxiv: 2605.08608 · v1 · submitted 2026-05-09 · 📡 eess.AS

Recognition: 2 theorem links

· Lean Theorem

Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation

Hang Su, Jian Luan, Jing Lu, Junnan Wu, Lichun Fan, Tianyi Tan, Xiaobin Rong, Zhenbo Luo, Zheng Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:06 UTC · model grok-4.3

classification 📡 eess.AS
keywords speech enhancementlanguage modellinguistic hallucinationnoise-invariantacoustic-semantic distillationcontent faithfulnessautoregressive LM
0
0 comments X

The pith

A distillation framework creates noise-invariant representations that reduce linguistic hallucination in language-model speech enhancement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that jointly distilling acoustic reconstruction and semantic consistency targets from clean speech trains an encoder to produce conditioning signals that ignore noise, enabling an autoregressive language model to output faithful clean speech tokens instead of inventing incorrect linguistic content. A sympathetic reader would care because LM-based enhancers often produce audio that sounds natural yet conveys the wrong words or meaning when noise is strong, which can mislead downstream systems such as automatic speech recognition. The central idea is to force the encoder to preserve both the sound structure and the meaning of the original clean speech even when its input is heavily corrupted. This leads to measurable improvements in content faithfulness metrics, particularly when signal-to-noise ratios are low or reverberation is present.

Core claim

The central claim is that a noise-invariant acoustic-semantic conditioning encoder, obtained by distilling both reconstruction fidelity and linguistic consistency from clean speech, allows a decoder-only autoregressive language model to predict accurate clean acoustic tokens from noisy inputs, substantially reducing hallucination while a learnable WavLM-based codec ensures high perceptual quality.

What carries the argument

The noise-invariant acoustic-semantic conditioning encoder learned via joint distillation of acoustic and semantic clean-speech targets.

If this is right

  • The proposed method improves linguistic consistency metrics over prior LM-based speech enhancement approaches.
  • Performance gains are especially evident under low-SNR and reverberant conditions.
  • Perceptual quality stays competitive with baseline systems.
  • The framework supports high-quality generation through its WavLM-derived codec.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the distillation succeeds in creating invariant features, the same approach could help stabilize other generative models that rely on degraded conditioning signals.
  • Semantic distillation may prove more critical than acoustic alone for preventing content drift in audio generation tasks.
  • Applying the method to real recorded noise rather than simulated mixtures would test whether the invariance holds outside controlled experiments.

Load-bearing premise

That the joint distillation of acoustic and semantic targets from clean speech will result in conditioning representations that stay effective and noise-invariant when the input speech is severely degraded, stopping the language model from generating linguistically incorrect outputs.

What would settle it

Observing no reduction in linguistic hallucination rates, measured by content faithfulness metrics, when the method is evaluated on speech inputs with signal-to-noise ratios below 0 dB would falsify the claim that the distilled representations prevent unreliable conditioning.

Figures

Figures reproduced from arXiv: 2605.08608 by Hang Su, Jian Luan, Jing Lu, Junnan Wu, Lichun Fan, Tianyi Tan, Xiaobin Rong, Zhenbo Luo, Zheng Wang.

Figure 1
Figure 1. Figure 1: Overview of L3-SE. NI-Encoder extracts acoustic and semantic representations from noisy speech, which are projected [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Acoustic-semantic joint distillation for learning [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Acoustic layer weights for the teacher, single [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Different SNR distributions of DNS1 and our simulated sets. (a) DNS1, (b) general-SNR testset, (c) low-SNR testset [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Spectrogram-based qualitative comparison on a low-SNR utterance. Enhanced spectrograms and ASR transcripts, [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
read the original abstract

Language model (LM)-based speech enhancement (SE) can generate natural-sounding speech, but under severe noise it often suffers from unreliable conditioning, leading to perceptually plausible yet linguistically incorrect outputs. To address this issue, we propose L3-SE, a noise-invariant acoustic-semantic distillation framework for reducing linguistic hallucination in LM-based SE. The proposed method learns a noise-invariant conditioning encoder from noisy speech by jointly distilling two complementary clean-speech targets: an acoustic target for reconstruction fidelity and a semantic target for linguistic consistency. The resulting noise-invariant acoustic-semantic representations are used to condition a decoder-only autoregressive language model, which predicts clean acoustic tokens that are decoded into enhanced speech. To support high-quality generation, we further employ a high-fidelity codec built on learnable weighted WavLM layer representations as the discrete acoustic interface. By improving the reliability of conditioning under adverse conditions, the proposed framework substantially reduces hallucination and improves content faithfulness. Experiments show that the proposed method consistently outperforms prior LM-based speech enhancement baselines on linguistic consistency metrics, with especially clear gains under low-SNR and reverberant conditions, while maintaining competitive perceptual quality. Audio samples are available at https://max1wz.github.io/L3-SE-Demo-Page/. The complete source code will be released after the manuscript is accepted.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes L3-SE, a noise-invariant acoustic-semantic distillation framework for LM-based speech enhancement. It trains a conditioning encoder on noisy inputs by jointly distilling acoustic reconstruction and semantic consistency targets from clean speech, then feeds the resulting representations into a decoder-only autoregressive LM that predicts clean acoustic tokens decoded via a high-fidelity WavLM-based codec. The central claim is that this yields reliable conditioning under adverse conditions, substantially reducing linguistic hallucination while improving content faithfulness, with experiments showing consistent gains on linguistic consistency metrics especially in low-SNR and reverberant settings.

Significance. If the distilled representations prove noise-invariant and causally reduce hallucinations, the work would address a key failure mode in LM-based SE, offering a principled way to improve linguistic reliability without sacrificing perceptual quality. The joint distillation approach and planned code release are positive contributions that could support reproducibility and further research in robust speech generation.

major comments (3)
  1. [§3] §3 (Method, conditioning encoder): The noise-invariance property is load-bearing for the central claim that reliable conditioning prevents linguistically incorrect tokens, yet the manuscript provides no direct diagnostics such as representation cosine similarity, Euclidean distances, or alignment metrics between clean and noisy versions of the same utterance across SNR levels or reverberation conditions.
  2. [§4] §4 (Experiments): Downstream gains on linguistic consistency metrics are reported, but without ablations that isolate the contribution of the noise-invariance mechanism (e.g., comparing against a non-distilled encoder or semantic-only distillation), it remains unclear whether observed improvements stem from the claimed invariance or from other training effects.
  3. [§4.2] §4.2 (Results under low-SNR): The abstract and results claim especially clear gains under low-SNR and reverberant conditions, but the absence of error bars, statistical significance tests, or detailed dataset/SNR breakdown tables makes it difficult to assess the robustness and magnitude of the reported outperformance.
minor comments (2)
  1. [§3.3] The high-fidelity codec description references learnable weighted WavLM layers but does not specify the exact weighting scheme or layer selection criterion in the main text.
  2. [Figure 1] Figure captions for the overall architecture could more explicitly label the distillation targets and the interface to the decoder-only LM.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that additional direct evidence and ablations would strengthen the manuscript's claims regarding noise-invariance and its causal role in reducing hallucinations. We address each major comment below and will revise the paper accordingly.

read point-by-point responses
  1. Referee: [§3] §3 (Method, conditioning encoder): The noise-invariance property is load-bearing for the central claim that reliable conditioning prevents linguistically incorrect tokens, yet the manuscript provides no direct diagnostics such as representation cosine similarity, Euclidean distances, or alignment metrics between clean and noisy versions of the same utterance across SNR levels or reverberation conditions.

    Authors: We agree that explicit diagnostics are valuable to directly substantiate the noise-invariance of the distilled representations. In the revised manuscript, we will add quantitative analyses including cosine similarity, Euclidean distances, and alignment metrics (e.g., CCA) computed between clean and noisy utterance pairs at multiple SNR levels and under reverberation. These will be presented in a new subsection or figure in §3 to complement the existing downstream results. revision: yes

  2. Referee: [§4] §4 (Experiments): Downstream gains on linguistic consistency metrics are reported, but without ablations that isolate the contribution of the noise-invariance mechanism (e.g., comparing against a non-distilled encoder or semantic-only distillation), it remains unclear whether observed improvements stem from the claimed invariance or from other training effects.

    Authors: We acknowledge that isolating the noise-invariance mechanism via targeted ablations would clarify its specific contribution. We will add these ablations in the revised §4, including variants with a non-distilled encoder (trained only on noisy inputs without clean targets) and semantic-only distillation (omitting the acoustic reconstruction loss). Results will be reported on the same linguistic consistency metrics to demonstrate the benefit of the joint acoustic-semantic approach. revision: yes

  3. Referee: [§4.2] §4.2 (Results under low-SNR): The abstract and results claim especially clear gains under low-SNR and reverberant conditions, but the absence of error bars, statistical significance tests, or detailed dataset/SNR breakdown tables makes it difficult to assess the robustness and magnitude of the reported outperformance.

    Authors: We agree that error bars, statistical tests, and finer-grained breakdowns are necessary to rigorously support the claims of outperformance in challenging conditions. In the revision, we will include standard deviation error bars (computed over multiple random seeds), paired statistical significance tests (e.g., t-tests or Wilcoxon), and expanded tables with per-SNR and per-dataset breakdowns for all key metrics in §4.2 and the supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: new distillation framework with independent experimental validation

full rationale

The paper introduces L3-SE as a methodological proposal that jointly distills acoustic reconstruction and semantic consistency targets from clean speech to produce a conditioning encoder for a decoder-only LM. No equations, parameter fits, or derivations are described that would reduce the claimed noise-invariance or hallucination reduction to a quantity defined by the method's own inputs. The central claim rests on the architectural design and downstream empirical gains on linguistic consistency metrics under low-SNR conditions, without self-citations, ansatzes smuggled via prior work, or renaming of known results. The derivation chain is self-contained as an engineering contribution rather than a closed loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that clean-speech acoustic and semantic targets can be jointly distilled to yield representations that remain invariant under severe noise; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)
  • domain assumption Joint distillation of acoustic reconstruction and semantic consistency targets from clean speech produces noise-invariant conditioning representations usable by an autoregressive LM
    Invoked as the core mechanism for reducing hallucination under adverse conditions.
invented entities (1)
  • L3-SE noise-invariant conditioning encoder no independent evidence
    purpose: To supply reliable conditioning to the LM decoder from noisy inputs
    Newly proposed component whose noise-invariance is asserted but not independently evidenced in the abstract.

pith-pipeline@v0.9.0 · 5560 in / 1480 out tokens · 58058 ms · 2026-05-12T01:06:16.356240+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 6 internal anchors

  1. [1]

    Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, An- toine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. 2023. MusicLM: Generating music from text.arXiv preprint arXiv:2301.11325 (2023)

  2. [2]

    Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems33 (2020), 12449–12460

  3. [3]

    Xuankai Chang, Takashi Maekaku, Yuya Fujita, and Shinji Watanabe. 2022. End-to-End Integration of Speech Recognition, Speech Enhancement, and Self- Supervised Learning Representation. InInterspeech 2022. 3819–3823

  4. [4]

    Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al . 2022. WavLM: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing16, 6 (2022), 1505–1518

  5. [5]

    Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. 2016. FMA: A dataset for music analysis.arXiv preprint arXiv:1612.01840(2016)

  6. [6]

    Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. 2023. High fidelity neural audio compression.Transactions on Machine Learning Research (2023)

  7. [7]

    Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. 2024. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037(2024)

  8. [8]

    Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al . 2024. Cosyvoice 2: Scal- able streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117(2024)

  9. [9]

    Harishchandra Dubey, Ashkan Aazami, Vishak Gopal, Babak Naderi, Sebastian Braun, Ross Cutler, Alex Ju, Mehdi Zohourian, Min Tang, Mehrsa Golestaneh, et al. 2024. ICASSP 2023 deep noise suppression challenge.IEEE Open Journal of Signal Processing5 (2024), 725–737

  10. [10]

    Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12873–12883

  11. [11]

    Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. 2021. FSD50K: an open dataset of human-labeled sound events.IEEE/ACM Transactions on Audio, Speech, and Language Processing30 (2021), 829–852

  12. [12]

    Yitian Gong, Luozhijie Jin, Ruifan Deng, Dong Zhang, Xin Zhang, Qinyuan Cheng, Zhaoye Fei, Shimin Li, and Xipeng Qiu. 2025. XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs.arXiv preprint arXiv:2506.23325(2025)

  13. [13]

    Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber

  14. [14]

    InProceedings of the 23rd international conference on Machine learning

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. InProceedings of the 23rd international conference on Machine learning. 369–376

  15. [15]

    Heitor R Guimarães, Arthur Pimentel, Anderson R Avila, Mehdi Rezagholizadeh, Boxing Chen, and Tiago H Falk. 2023. Robustdistiller: Compressing universal speech representations for enhanced environment robustness. InICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

  16. [16]

    Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, et al . 2024. Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation. In 2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 885–890

  17. [17]

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing29 (2021), 3451–3460

  18. [18]

    Bryce Irvin, Marko Stamenovic, Mikolaj Kegler, and Li-Chia Yang. 2023. Self- supervised learning for speech enhancement through synthesis. InICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

  19. [19]

    Jesper Jensen and Cees H Taal. 2016. An algorithm for predicting the intelligibility of speech masked by modulated noise maskers.IEEE/ACM Transactions on Audio, Speech, and Language Processing24, 11 (2016), 2009–2022

  20. [20]

    Shengpeng Ji, Ziyue Jiang, Wen Wang, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Xize Cheng, Zehan Wang, Ruiqi Li, et al. 2024. WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling. In The Thirteenth International Conference on Learning Representations

  21. [21]

    Boyi Kang, Xinfa Zhu, Zihan Zhang, Zhen Ye, Mingshuai Liu, Ziqian Wang, Yike Zhu, Guobin Ma, Jun Chen, Longshuai Xiao, Chao Weng, Wei Xue, and Lei Xie. 2025. LLaSE-G1: Incentivizing Generalization Capability for LLaMA- based Speech Enhancement. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers...

  22. [22]

    Adam Katav, Yair Moshe, and Israel Cohen. 2025. A Framework for Robust Speaker Verification in Highly Noisy Environments Leveraging Both Noisy and Preprint, Under review, 2026 Wang et al. Enhanced Audio. In2025 33rd European Signal Processing Conference (EUSIPCO). IEEE, 41–45

  23. [23]

    Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L Seltzer, and Sanjeev Khudanpur. 2017. A study on data augmentation of reverberant speech for robust speech recognition. In2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5220–5224

  24. [24]

    Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis.Advances in neural information processing systems33 (2020), 17022–17033

  25. [25]

    Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. 2023. High-fidelity audio compression with improved RVQGAN.Advances in Neural Information Processing Systems36 (2023), 27980–27993

  26. [26]

    Jean-Marie Lemercier, Julius Richter, Simon Welker, and Timo Gerkmann. 2023. StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation.IEEE/ACM Transactions on Audio, Speech, and Language Processing31 (2023), 2724–2737

  27. [27]

    Chenda Li, Wangyou Zhang, Wei Wang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Yihui Fu, Marvin Sach, Zhaoheng Ni, Anurag Kumar, et al. 2025. Less is More: Data Curation Matters in Scaling Speech Enhancement.arXiv preprint arXiv:2506.23859(2025)

  28. [28]

    Fengjin Li, Jie Wang, Yadong Niu, Yongqing Wang, Meng Meng, Jian Luan, and Zhiyong Wu. 2025. StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion. InInterspeech 2025. 4593–4597

  29. [29]

    Xu Li, Qirui Wang, and Xiaoyu Liu. 2024. MaskSR: Masked Language Model for Full-band Speech Restoration. InProc. Interspeech 2024. 2275–2279

  30. [30]

    Xingchen Li, Hanke Xie, Ziqian Wang, Zihan Zhang, Longshuai Xiao, and Lei Xie. 2025. SenSE: Semantic-Aware High-Fidelity Universal Speech Enhancement. arXiv preprint arXiv:2509.24708(2025)

  31. [31]

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11976–11986

  32. [32]

    Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101(2017)

  33. [33]

    Yen-Ju Lu, Zhong-Qiu Wang, Shinji Watanabe, Alexander Richard, Cheng Yu, and Yu Tsao. 2022. Conditional diffusion probabilistic model for speech enhancement. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7402–7406

  34. [34]

    Zhaoxi Mu, Rilin Chen, Andong Li, Meng Yu, Xinyu Yang, and Dong Yu. 2025. From Continuous to Discrete: Cross-Domain Collaborative General Speech En- hancement via Hierarchical Language Models. InProceedings of the 33rd ACM International Conference on Multimedia. 219–228

  35. [35]

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. LibriSpeech: an ASR corpus based on public domain audio books. In2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5206–5210

  36. [36]

    Ankita Pasad, Bowen Shi, and Karen Livescu. 2023. Comparative layer-wise analysis of self-supervised speech models. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

  37. [37]

    Jan Pirklbauer, Marvin Sach, Kristoff Fluyt, Wouter Tirry, Wafaa Wardah, Sebas- tian Moeller, and Tim Fingscheidt. 2023. Evaluation metrics for generative speech enhancement methods: Issues and perspectives. InSpeech Communication; 15th ITG Conference. VDE, 265–269

  38. [38]

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning. PMLR, 28492–28518

  39. [39]

    ITU-T Recommendation. 2001. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs.Rec. ITU-T P. 862(2001)

  40. [40]

    Chandan KA Reddy, Vishak Gopal, and Ross Cutler. 2022. DNSMOS P. 835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors. InICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 886–890

  41. [41]

    Chandan K.A. Reddy, Vishak Gopal, Ross Cutler, Ebrahim Beyrami, Roger Cheng, Harishchandra Dubey, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, Puneet Rana, Sriram Srinivasan, and Johannes Gehrke. 2020. The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results. InInterspeech 2...

  42. [42]

    Xiaobin Rong, Qinwen Hu, Mansur Yesilbursa, Kamil Wojcicki, and Jing Lu. 2026. PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Gen- erative Speech Enhancement. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 32826–32834

  43. [43]

    Takaaki Saeki, Soumi Maiti, Shinnosuke Takamichi, Shinji Watanabe, and Hiroshi Saruwatari. 2024. SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics. InInterspeech 2024. 4943–4947

  44. [44]

    Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. 2022. UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022. InInterspeech 2022. 4521–4525

  45. [45]

    Kohei Saijo, Wangyou Zhang, Samuele Cornell, Robin Scheibler, Chenda Li, Zhaoheng Ni, Anurag Kumar, Marvin Sach, Yihui Fu, Wei Wang, Tim Fingscheidt, and Shinji Watanabe. 2025. Interspeech 2025 URGENT Speech Enhancement Challenge. InInterspeech 2025. 858–862

  46. [46]

    Hubert Siuzdak. 2024. Vocos: Closing the gap between time-domain and Fourier- based neural vocoders for high-quality audio synthesis. InThe Twelfth Interna- tional Conference on Learning Representations

  47. [47]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  48. [48]

    Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, et al. 2025. Spark-TTS: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens.arXiv preprint arXiv:2503.01710(2025)

  49. [49]

    Yiming Wang, Jinyu Li, Heming Wang, Yao Qian, Chengyi Wang, and Yu Wu. 2022. Wav2vec-Switch: Contrastive Learning from Original-Noisy Speech Pairs for Robust Speech Recognition. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7097–7101

  50. [50]

    Yuanyuan Wang, Dongchao Yang, Yiwen Shao, Hangting Chen, Jiankun Zhao, Zhiyong Wu, Helen Meng, and Xixin Wu. 2026. DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 33728–33736

  51. [51]

    Ziqian Wang, Zikai Liu, Xinfa Zhu, Yike Zhu, Mingshuai Liu, Jun Chen, Longshuai Xiao, Chao Weng, and Lei Xie. 2025. FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching. InInterspeech 2025. 4858–4862

  52. [52]

    Ziqian Wang, Xinfa Zhu, Zihan Zhang, YuanJun Lv, Ning Jiang, Guoqing Zhao, and Lei Xie. 2024. SELM: Speech enhancement using discrete tokens and language models. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 11561–11565

  53. [53]

    Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, and Shinji Watanabe. 2023. TF-GridNet: Integrating full-and sub-band modeling for speech separation.IEEE/ACM Transactions on Audio, Speech, and Language Processing31 (2023), 3221–3236

  54. [54]

    Dan Wells, Hao Tang, and Korin Richmond. 2022. Phonetic Analysis of Self- supervised Representations of English Speech. InInterspeech 2022. 3583–3587

  55. [55]

    Gordon Wichern, Joe Antognini, Michael Flynn, Licheng Richard Zhu, Emmett McQuinn, Dwight Crow, Ethan Manilow, and Jonathan Le Roux. 2019. WHAM!: Extending Speech Separation to Noisy Environments. InInterspeech 2019. 1368– 1372

  56. [56]

    Ronald J Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks.Neural computation1, 2 (1989), 270–280

  57. [57]

    Detai Xin, Xu Tan, Shinnosuke Takamichi, and Hiroshi Saruwatari. 2024. Big- Codec: Pushing the limits of low-bitrate neural speech codec.arXiv preprint arXiv:2409.05377(2024)

  58. [58]

    Haoyin Yan, Chengwei Liu, Shaofei Xue, Xiaotao Liang, and Zheng Xue. 2025. UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement.arXiv preprint arXiv:2510.20441(2025)

  59. [59]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  60. [60]

    Haici Yang, Jiaqi Su, Minje Kim, and Zeyu Jin. 2024. Genhancer: High-Fidelity Speech Enhancement via Generative Modeling on Discrete Codec Tokens. In Interspeech 2024. 1170–1174

  61. [61]

    Jixun Yao, Hexin Liu, Chen Chen, Yuchen Hu, EngSiong Chng, and Lei Xie. 2025. GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling. InThe Thirteenth International Conference on Learning Representations

  62. [62]

    Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, et al. 2025. Llasa: Scaling train-time and inference- time compute for llama-based speech synthesis.arXiv preprint arXiv:2502.04128 (2025)

  63. [63]

    Jianwei Yu and Yi Luo. 2023. Efficient monaural speech enhancement with universal sample rate band-split RNN. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

  64. [64]

    Dong Zhang, Gang Wang, Jinlong Xue, Kai Fang, Liang Zhao, Rui Ma, Shuhuai Ren, Shuo Liu, Tao Guo, Weiji Zhuang, et al. 2025. MiMo-Audio: Audio Language Models are Few-Shot Learners.arXiv preprint arXiv:2512.23808(2025)

  65. [65]

    Junan Zhang, Jing Yang, Zihao Fang, Yuancheng Wang, Zehua Zhang, Zhuo Wang, Fan Fan, and Zhizheng Wu. 2025. AnyEnhance: A Unified Generative Model With Prompt-Guidance and Self-Critic for Voice Enhancement.IEEE Transactions on Audio, Speech and Language Processing33 (2025), 3085–3098

  66. [66]

    the behaviorist who attempts to make psychology a record of behavior has to trust his memory in making the record

    Qiu-Shi Zhu, Jie Zhang, Zi-Qiang Zhang, Ming-Hui Wu, Xin Fang, and Li-Rong Dai. 2022. A noise-robust self-supervised pre-training model based speech repre- sentation learning for automatic speech recognition. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3174–3178. Reducing Linguistic Hallucina...