arxiv: 2605.08608 · v1 · submitted 2026-05-09 · 📡 eess.AS

Recognition: 2 theorem links

· Lean Theorem

Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation

Hang Su, Jian Luan, Jing Lu, Junnan Wu, Lichun Fan, Tianyi Tan, Xiaobin Rong, Zhenbo Luo, Zheng Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:06 UTC · model grok-4.3

classification 📡 eess.AS

keywords speech enhancementlanguage modellinguistic hallucinationnoise-invariantacoustic-semantic distillationcontent faithfulnessautoregressive LM

0 comments

The pith

A distillation framework creates noise-invariant representations that reduce linguistic hallucination in language-model speech enhancement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that jointly distilling acoustic reconstruction and semantic consistency targets from clean speech trains an encoder to produce conditioning signals that ignore noise, enabling an autoregressive language model to output faithful clean speech tokens instead of inventing incorrect linguistic content. A sympathetic reader would care because LM-based enhancers often produce audio that sounds natural yet conveys the wrong words or meaning when noise is strong, which can mislead downstream systems such as automatic speech recognition. The central idea is to force the encoder to preserve both the sound structure and the meaning of the original clean speech even when its input is heavily corrupted. This leads to measurable improvements in content faithfulness metrics, particularly when signal-to-noise ratios are low or reverberation is present.

Core claim

The central claim is that a noise-invariant acoustic-semantic conditioning encoder, obtained by distilling both reconstruction fidelity and linguistic consistency from clean speech, allows a decoder-only autoregressive language model to predict accurate clean acoustic tokens from noisy inputs, substantially reducing hallucination while a learnable WavLM-based codec ensures high perceptual quality.

What carries the argument

The noise-invariant acoustic-semantic conditioning encoder learned via joint distillation of acoustic and semantic clean-speech targets.

If this is right

The proposed method improves linguistic consistency metrics over prior LM-based speech enhancement approaches.
Performance gains are especially evident under low-SNR and reverberant conditions.
Perceptual quality stays competitive with baseline systems.
The framework supports high-quality generation through its WavLM-derived codec.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the distillation succeeds in creating invariant features, the same approach could help stabilize other generative models that rely on degraded conditioning signals.
Semantic distillation may prove more critical than acoustic alone for preventing content drift in audio generation tasks.
Applying the method to real recorded noise rather than simulated mixtures would test whether the invariance holds outside controlled experiments.

Load-bearing premise

That the joint distillation of acoustic and semantic targets from clean speech will result in conditioning representations that stay effective and noise-invariant when the input speech is severely degraded, stopping the language model from generating linguistically incorrect outputs.

What would settle it

Observing no reduction in linguistic hallucination rates, measured by content faithfulness metrics, when the method is evaluated on speech inputs with signal-to-noise ratios below 0 dB would falsify the claim that the distilled representations prevent unreliable conditioning.

Figures

Figures reproduced from arXiv: 2605.08608 by Hang Su, Jian Luan, Jing Lu, Junnan Wu, Lichun Fan, Tianyi Tan, Xiaobin Rong, Zhenbo Luo, Zheng Wang.

**Figure 1.** Figure 1: Overview of L3-SE. NI-Encoder extracts acoustic and semantic representations from noisy speech, which are projected [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 3.** Figure 3: Acoustic-semantic joint distillation for learning [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Acoustic layer weights for the teacher, single [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Different SNR distributions of DNS1 and our simulated sets. (a) DNS1, (b) general-SNR testset, (c) low-SNR testset [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Spectrogram-based qualitative comparison on a low-SNR utterance. Enhanced spectrograms and ASR transcripts, [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

Language model (LM)-based speech enhancement (SE) can generate natural-sounding speech, but under severe noise it often suffers from unreliable conditioning, leading to perceptually plausible yet linguistically incorrect outputs. To address this issue, we propose L3-SE, a noise-invariant acoustic-semantic distillation framework for reducing linguistic hallucination in LM-based SE. The proposed method learns a noise-invariant conditioning encoder from noisy speech by jointly distilling two complementary clean-speech targets: an acoustic target for reconstruction fidelity and a semantic target for linguistic consistency. The resulting noise-invariant acoustic-semantic representations are used to condition a decoder-only autoregressive language model, which predicts clean acoustic tokens that are decoded into enhanced speech. To support high-quality generation, we further employ a high-fidelity codec built on learnable weighted WavLM layer representations as the discrete acoustic interface. By improving the reliability of conditioning under adverse conditions, the proposed framework substantially reduces hallucination and improves content faithfulness. Experiments show that the proposed method consistently outperforms prior LM-based speech enhancement baselines on linguistic consistency metrics, with especially clear gains under low-SNR and reverberant conditions, while maintaining competitive perceptual quality. Audio samples are available at https://max1wz.github.io/L3-SE-Demo-Page/. The complete source code will be released after the manuscript is accepted.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a joint acoustic-semantic distillation method to stabilize conditioning for LM-based speech enhancement, but the key noise-invariance claim rests on indirect evidence.

read the letter

The punchline is that this work gives a distillation trick to make LM conditioning for speech enhancement more stable in noise, but the proof that it works via noise-invariance is indirect at best. They introduce L3-SE, where an encoder processes noisy speech and is trained to match both acoustic reconstruction targets and semantic consistency targets from the clean speech. Those representations condition a decoder-only autoregressive LM that generates clean acoustic tokens, which a WavLM-based codec turns into waveform. The idea is this keeps the LM from hallucinating wrong words when SNR is low or there's reverb. What the paper does well is identify a practical failure mode and propose a concrete way to mitigate it with joint distillation. The use of learnable weighted WavLM layers for the codec is a solid choice for high fidelity. The experiments reportedly show better linguistic metrics with gains in tough conditions. The soft spots are around verification of the central mechanism. As the stress-test notes, there's no explicit check that the distilled representations stay invariant when noise is severe. Representation similarity metrics between clean and noisy inputs aren't mentioned, so we can't rule out that the benefits come from other aspects of the training. The abstract lacks numbers, error bars, or dataset info, which makes it hard to judge the size of the effect without the full paper. This is aimed at people in the speech enhancement community who are exploring language model approaches. Anyone dealing with robustness in generative audio models could pick up the distillation idea. It shows honest engagement with the hallucination issue and builds on existing techniques without obvious contradictions. The paper deserves a serious referee to check the experiments and see if the invariance holds up under scrutiny. Recommendation: Send it for peer review, but flag the need for direct invariance diagnostics and full results.

Referee Report

3 major / 2 minor

Summary. The paper proposes L3-SE, a noise-invariant acoustic-semantic distillation framework for LM-based speech enhancement. It trains a conditioning encoder on noisy inputs by jointly distilling acoustic reconstruction and semantic consistency targets from clean speech, then feeds the resulting representations into a decoder-only autoregressive LM that predicts clean acoustic tokens decoded via a high-fidelity WavLM-based codec. The central claim is that this yields reliable conditioning under adverse conditions, substantially reducing linguistic hallucination while improving content faithfulness, with experiments showing consistent gains on linguistic consistency metrics especially in low-SNR and reverberant settings.

Significance. If the distilled representations prove noise-invariant and causally reduce hallucinations, the work would address a key failure mode in LM-based SE, offering a principled way to improve linguistic reliability without sacrificing perceptual quality. The joint distillation approach and planned code release are positive contributions that could support reproducibility and further research in robust speech generation.

major comments (3)

[§3] §3 (Method, conditioning encoder): The noise-invariance property is load-bearing for the central claim that reliable conditioning prevents linguistically incorrect tokens, yet the manuscript provides no direct diagnostics such as representation cosine similarity, Euclidean distances, or alignment metrics between clean and noisy versions of the same utterance across SNR levels or reverberation conditions.
[§4] §4 (Experiments): Downstream gains on linguistic consistency metrics are reported, but without ablations that isolate the contribution of the noise-invariance mechanism (e.g., comparing against a non-distilled encoder or semantic-only distillation), it remains unclear whether observed improvements stem from the claimed invariance or from other training effects.
[§4.2] §4.2 (Results under low-SNR): The abstract and results claim especially clear gains under low-SNR and reverberant conditions, but the absence of error bars, statistical significance tests, or detailed dataset/SNR breakdown tables makes it difficult to assess the robustness and magnitude of the reported outperformance.

minor comments (2)

[§3.3] The high-fidelity codec description references learnable weighted WavLM layers but does not specify the exact weighting scheme or layer selection criterion in the main text.
[Figure 1] Figure captions for the overall architecture could more explicitly label the distillation targets and the interface to the decoder-only LM.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that additional direct evidence and ablations would strengthen the manuscript's claims regarding noise-invariance and its causal role in reducing hallucinations. We address each major comment below and will revise the paper accordingly.

read point-by-point responses

Referee: [§3] §3 (Method, conditioning encoder): The noise-invariance property is load-bearing for the central claim that reliable conditioning prevents linguistically incorrect tokens, yet the manuscript provides no direct diagnostics such as representation cosine similarity, Euclidean distances, or alignment metrics between clean and noisy versions of the same utterance across SNR levels or reverberation conditions.

Authors: We agree that explicit diagnostics are valuable to directly substantiate the noise-invariance of the distilled representations. In the revised manuscript, we will add quantitative analyses including cosine similarity, Euclidean distances, and alignment metrics (e.g., CCA) computed between clean and noisy utterance pairs at multiple SNR levels and under reverberation. These will be presented in a new subsection or figure in §3 to complement the existing downstream results. revision: yes
Referee: [§4] §4 (Experiments): Downstream gains on linguistic consistency metrics are reported, but without ablations that isolate the contribution of the noise-invariance mechanism (e.g., comparing against a non-distilled encoder or semantic-only distillation), it remains unclear whether observed improvements stem from the claimed invariance or from other training effects.

Authors: We acknowledge that isolating the noise-invariance mechanism via targeted ablations would clarify its specific contribution. We will add these ablations in the revised §4, including variants with a non-distilled encoder (trained only on noisy inputs without clean targets) and semantic-only distillation (omitting the acoustic reconstruction loss). Results will be reported on the same linguistic consistency metrics to demonstrate the benefit of the joint acoustic-semantic approach. revision: yes
Referee: [§4.2] §4.2 (Results under low-SNR): The abstract and results claim especially clear gains under low-SNR and reverberant conditions, but the absence of error bars, statistical significance tests, or detailed dataset/SNR breakdown tables makes it difficult to assess the robustness and magnitude of the reported outperformance.

Authors: We agree that error bars, statistical tests, and finer-grained breakdowns are necessary to rigorously support the claims of outperformance in challenging conditions. In the revision, we will include standard deviation error bars (computed over multiple random seeds), paired statistical significance tests (e.g., t-tests or Wilcoxon), and expanded tables with per-SNR and per-dataset breakdowns for all key metrics in §4.2 and the supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: new distillation framework with independent experimental validation

full rationale

The paper introduces L3-SE as a methodological proposal that jointly distills acoustic reconstruction and semantic consistency targets from clean speech to produce a conditioning encoder for a decoder-only LM. No equations, parameter fits, or derivations are described that would reduce the claimed noise-invariance or hallucination reduction to a quantity defined by the method's own inputs. The central claim rests on the architectural design and downstream empirical gains on linguistic consistency metrics under low-SNR conditions, without self-citations, ansatzes smuggled via prior work, or renaming of known results. The derivation chain is self-contained as an engineering contribution rather than a closed loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that clean-speech acoustic and semantic targets can be jointly distilled to yield representations that remain invariant under severe noise; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)

domain assumption Joint distillation of acoustic reconstruction and semantic consistency targets from clean speech produces noise-invariant conditioning representations usable by an autoregressive LM
Invoked as the core mechanism for reducing hallucination under adverse conditions.

invented entities (1)

L3-SE noise-invariant conditioning encoder no independent evidence
purpose: To supply reliable conditioning to the LM decoder from noisy inputs
Newly proposed component whose noise-invariance is asserted but not independently evidenced in the abstract.

pith-pipeline@v0.9.0 · 5560 in / 1480 out tokens · 58058 ms · 2026-05-12T01:06:16.356240+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
learns a noise-invariant conditioning encoder from noisy speech by jointly distilling two complementary clean-speech targets: an acoustic target for reconstruction fidelity and a semantic target for linguistic consistency
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
NI-Encoder follows the same overall architectural form as the teachers, namely a shared WavLM backbone followed by acoustic and semantic heads

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 6 internal anchors

[1]

Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, An- toine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. 2023. MusicLM: Generating music from text.arXiv preprint arXiv:2301.11325 (2023)

work page internal anchor Pith review arXiv 2023
[2]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems33 (2020), 12449–12460

work page 2020
[3]

Xuankai Chang, Takashi Maekaku, Yuya Fujita, and Shinji Watanabe. 2022. End-to-End Integration of Speech Recognition, Speech Enhancement, and Self- Supervised Learning Representation. InInterspeech 2022. 3819–3823

work page 2022
[4]

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al . 2022. WavLM: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing16, 6 (2022), 1505–1518

work page 2022
[5]

Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. 2016. FMA: A dataset for music analysis.arXiv preprint arXiv:1612.01840(2016)

work page arXiv 2016
[6]

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. 2023. High fidelity neural audio compression.Transactions on Machine Learning Research (2023)

work page 2023
[7]

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. 2024. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037(2024)

work page internal anchor Pith review arXiv 2024
[8]

Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al . 2024. Cosyvoice 2: Scal- able streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117(2024)

work page internal anchor Pith review arXiv 2024
[9]

Harishchandra Dubey, Ashkan Aazami, Vishak Gopal, Babak Naderi, Sebastian Braun, Ross Cutler, Alex Ju, Mehdi Zohourian, Min Tang, Mehrsa Golestaneh, et al. 2024. ICASSP 2023 deep noise suppression challenge.IEEE Open Journal of Signal Processing5 (2024), 725–737

work page 2024
[10]

Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12873–12883

work page 2021
[11]

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. 2021. FSD50K: an open dataset of human-labeled sound events.IEEE/ACM Transactions on Audio, Speech, and Language Processing30 (2021), 829–852

work page 2021
[12]

Yitian Gong, Luozhijie Jin, Ruifan Deng, Dong Zhang, Xin Zhang, Qinyuan Cheng, Zhaoye Fei, Shimin Li, and Xipeng Qiu. 2025. XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs.arXiv preprint arXiv:2506.23325(2025)

work page arXiv 2025
[13]

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber

work page
[14]

InProceedings of the 23rd international conference on Machine learning

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. InProceedings of the 23rd international conference on Machine learning. 369–376

work page
[15]

Heitor R Guimarães, Arthur Pimentel, Anderson R Avila, Mehdi Rezagholizadeh, Boxing Chen, and Tiago H Falk. 2023. Robustdistiller: Compressing universal speech representations for enhanced environment robustness. InICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

work page 2023
[16]

Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, et al . 2024. Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation. In 2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 885–890

work page 2024
[17]

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing29 (2021), 3451–3460

work page 2021
[18]

Bryce Irvin, Marko Stamenovic, Mikolaj Kegler, and Li-Chia Yang. 2023. Self- supervised learning for speech enhancement through synthesis. InICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

work page 2023
[19]

Jesper Jensen and Cees H Taal. 2016. An algorithm for predicting the intelligibility of speech masked by modulated noise maskers.IEEE/ACM Transactions on Audio, Speech, and Language Processing24, 11 (2016), 2009–2022

work page 2016
[20]

Shengpeng Ji, Ziyue Jiang, Wen Wang, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Xize Cheng, Zehan Wang, Ruiqi Li, et al. 2024. WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling. In The Thirteenth International Conference on Learning Representations

work page 2024
[21]

Boyi Kang, Xinfa Zhu, Zihan Zhang, Zhen Ye, Mingshuai Liu, Ziqian Wang, Yike Zhu, Guobin Ma, Jun Chen, Longshuai Xiao, Chao Weng, Wei Xue, and Lei Xie. 2025. LLaSE-G1: Incentivizing Generalization Capability for LLaMA- based Speech Enhancement. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers...

work page 2025
[22]

Adam Katav, Yair Moshe, and Israel Cohen. 2025. A Framework for Robust Speaker Verification in Highly Noisy Environments Leveraging Both Noisy and Preprint, Under review, 2026 Wang et al. Enhanced Audio. In2025 33rd European Signal Processing Conference (EUSIPCO). IEEE, 41–45

work page 2025
[23]

Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L Seltzer, and Sanjeev Khudanpur. 2017. A study on data augmentation of reverberant speech for robust speech recognition. In2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5220–5224

work page 2017
[24]

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis.Advances in neural information processing systems33 (2020), 17022–17033

work page 2020
[25]

Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. 2023. High-fidelity audio compression with improved RVQGAN.Advances in Neural Information Processing Systems36 (2023), 27980–27993

work page 2023
[26]

Jean-Marie Lemercier, Julius Richter, Simon Welker, and Timo Gerkmann. 2023. StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation.IEEE/ACM Transactions on Audio, Speech, and Language Processing31 (2023), 2724–2737

work page 2023
[27]

Chenda Li, Wangyou Zhang, Wei Wang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Yihui Fu, Marvin Sach, Zhaoheng Ni, Anurag Kumar, et al. 2025. Less is More: Data Curation Matters in Scaling Speech Enhancement.arXiv preprint arXiv:2506.23859(2025)

work page arXiv 2025
[28]

Fengjin Li, Jie Wang, Yadong Niu, Yongqing Wang, Meng Meng, Jian Luan, and Zhiyong Wu. 2025. StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion. InInterspeech 2025. 4593–4597

work page 2025
[29]

Xu Li, Qirui Wang, and Xiaoyu Liu. 2024. MaskSR: Masked Language Model for Full-band Speech Restoration. InProc. Interspeech 2024. 2275–2279

work page 2024
[30]

Xingchen Li, Hanke Xie, Ziqian Wang, Zihan Zhang, Longshuai Xiao, and Lei Xie. 2025. SenSE: Semantic-Aware High-Fidelity Universal Speech Enhancement. arXiv preprint arXiv:2509.24708(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11976–11986

work page 2022
[32]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

Yen-Ju Lu, Zhong-Qiu Wang, Shinji Watanabe, Alexander Richard, Cheng Yu, and Yu Tsao. 2022. Conditional diffusion probabilistic model for speech enhancement. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7402–7406

work page 2022
[34]

Zhaoxi Mu, Rilin Chen, Andong Li, Meng Yu, Xinyu Yang, and Dong Yu. 2025. From Continuous to Discrete: Cross-Domain Collaborative General Speech En- hancement via Hierarchical Language Models. InProceedings of the 33rd ACM International Conference on Multimedia. 219–228

work page 2025
[35]

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. LibriSpeech: an ASR corpus based on public domain audio books. In2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5206–5210

work page 2015
[36]

Ankita Pasad, Bowen Shi, and Karen Livescu. 2023. Comparative layer-wise analysis of self-supervised speech models. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

work page 2023
[37]

Jan Pirklbauer, Marvin Sach, Kristoff Fluyt, Wouter Tirry, Wafaa Wardah, Sebas- tian Moeller, and Tim Fingscheidt. 2023. Evaluation metrics for generative speech enhancement methods: Issues and perspectives. InSpeech Communication; 15th ITG Conference. VDE, 265–269

work page 2023
[38]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning. PMLR, 28492–28518

work page 2023
[39]

ITU-T Recommendation. 2001. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs.Rec. ITU-T P. 862(2001)

work page 2001
[40]

Chandan KA Reddy, Vishak Gopal, and Ross Cutler. 2022. DNSMOS P. 835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors. InICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 886–890

work page 2022
[41]

Chandan K.A. Reddy, Vishak Gopal, Ross Cutler, Ebrahim Beyrami, Roger Cheng, Harishchandra Dubey, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, Puneet Rana, Sriram Srinivasan, and Johannes Gehrke. 2020. The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results. InInterspeech 2...

work page 2020
[42]

Xiaobin Rong, Qinwen Hu, Mansur Yesilbursa, Kamil Wojcicki, and Jing Lu. 2026. PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Gen- erative Speech Enhancement. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 32826–32834

work page 2026
[43]

Takaaki Saeki, Soumi Maiti, Shinnosuke Takamichi, Shinji Watanabe, and Hiroshi Saruwatari. 2024. SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics. InInterspeech 2024. 4943–4947

work page 2024
[44]

Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. 2022. UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022. InInterspeech 2022. 4521–4525

work page 2022
[45]

Kohei Saijo, Wangyou Zhang, Samuele Cornell, Robin Scheibler, Chenda Li, Zhaoheng Ni, Anurag Kumar, Marvin Sach, Yihui Fu, Wei Wang, Tim Fingscheidt, and Shinji Watanabe. 2025. Interspeech 2025 URGENT Speech Enhancement Challenge. InInterspeech 2025. 858–862

work page 2025
[46]

Hubert Siuzdak. 2024. Vocos: Closing the gap between time-domain and Fourier- based neural vocoders for high-quality audio synthesis. InThe Twelfth Interna- tional Conference on Learning Representations

work page 2024
[47]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

work page 2017
[48]

Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, et al. 2025. Spark-TTS: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens.arXiv preprint arXiv:2503.01710(2025)

work page arXiv 2025
[49]

Yiming Wang, Jinyu Li, Heming Wang, Yao Qian, Chengyi Wang, and Yu Wu. 2022. Wav2vec-Switch: Contrastive Learning from Original-Noisy Speech Pairs for Robust Speech Recognition. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7097–7101

work page 2022
[50]

Yuanyuan Wang, Dongchao Yang, Yiwen Shao, Hangting Chen, Jiankun Zhao, Zhiyong Wu, Helen Meng, and Xixin Wu. 2026. DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 33728–33736

work page 2026
[51]

Ziqian Wang, Zikai Liu, Xinfa Zhu, Yike Zhu, Mingshuai Liu, Jun Chen, Longshuai Xiao, Chao Weng, and Lei Xie. 2025. FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching. InInterspeech 2025. 4858–4862

work page 2025
[52]

Ziqian Wang, Xinfa Zhu, Zihan Zhang, YuanJun Lv, Ning Jiang, Guoqing Zhao, and Lei Xie. 2024. SELM: Speech enhancement using discrete tokens and language models. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 11561–11565

work page 2024
[53]

Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, and Shinji Watanabe. 2023. TF-GridNet: Integrating full-and sub-band modeling for speech separation.IEEE/ACM Transactions on Audio, Speech, and Language Processing31 (2023), 3221–3236

work page 2023
[54]

Dan Wells, Hao Tang, and Korin Richmond. 2022. Phonetic Analysis of Self- supervised Representations of English Speech. InInterspeech 2022. 3583–3587

work page 2022
[55]

Gordon Wichern, Joe Antognini, Michael Flynn, Licheng Richard Zhu, Emmett McQuinn, Dwight Crow, Ethan Manilow, and Jonathan Le Roux. 2019. WHAM!: Extending Speech Separation to Noisy Environments. InInterspeech 2019. 1368– 1372

work page 2019
[56]

Ronald J Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks.Neural computation1, 2 (1989), 270–280

work page 1989
[57]

Detai Xin, Xu Tan, Shinnosuke Takamichi, and Hiroshi Saruwatari. 2024. Big- Codec: Pushing the limits of low-bitrate neural speech codec.arXiv preprint arXiv:2409.05377(2024)

work page arXiv 2024
[58]

Haoyin Yan, Chengwei Liu, Shaofei Xue, Xiaotao Liang, and Zheng Xue. 2025. UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement.arXiv preprint arXiv:2510.20441(2025)

work page arXiv 2025
[59]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Haici Yang, Jiaqi Su, Minje Kim, and Zeyu Jin. 2024. Genhancer: High-Fidelity Speech Enhancement via Generative Modeling on Discrete Codec Tokens. In Interspeech 2024. 1170–1174

work page 2024
[61]

Jixun Yao, Hexin Liu, Chen Chen, Yuchen Hu, EngSiong Chng, and Lei Xie. 2025. GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling. InThe Thirteenth International Conference on Learning Representations

work page 2025
[62]

Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, et al. 2025. Llasa: Scaling train-time and inference- time compute for llama-based speech synthesis.arXiv preprint arXiv:2502.04128 (2025)

work page arXiv 2025
[63]

Jianwei Yu and Yi Luo. 2023. Efficient monaural speech enhancement with universal sample rate band-split RNN. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

work page 2023
[64]

Dong Zhang, Gang Wang, Jinlong Xue, Kai Fang, Liang Zhao, Rui Ma, Shuhuai Ren, Shuo Liu, Tao Guo, Weiji Zhuang, et al. 2025. MiMo-Audio: Audio Language Models are Few-Shot Learners.arXiv preprint arXiv:2512.23808(2025)

work page arXiv 2025
[65]

Junan Zhang, Jing Yang, Zihao Fang, Yuancheng Wang, Zehua Zhang, Zhuo Wang, Fan Fan, and Zhizheng Wu. 2025. AnyEnhance: A Unified Generative Model With Prompt-Guidance and Self-Critic for Voice Enhancement.IEEE Transactions on Audio, Speech and Language Processing33 (2025), 3085–3098

work page 2025
[66]

the behaviorist who attempts to make psychology a record of behavior has to trust his memory in making the record

Qiu-Shi Zhu, Jie Zhang, Zi-Qiang Zhang, Ming-Hui Wu, Xin Fang, and Li-Rong Dai. 2022. A noise-robust self-supervised pre-training model based speech repre- sentation learning for automatic speech recognition. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3174–3178. Reducing Linguistic Hallucina...

work page 2022