CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding

Eugene Kwek; Feng Liu; Rui Zhang; Wenpeng Yin

arxiv: 2606.04418 · v1 · pith:B4QKLSTUnew · submitted 2026-06-03 · 💻 cs.SD · cs.CL· eess.AS

CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding

Eugene Kwek , Feng Liu , Rui Zhang , Wenpeng Yin This is my paper

Pith reviewed 2026-06-28 05:18 UTC · model grok-4.3

classification 💻 cs.SD cs.CLeess.AS

keywords speech tokenizationneural audio codecsdenoisinginformation bottlenecktext-to-speechvoice conversionperceptual encodingaudio compression

0 comments

The pith

CleanCodec reframes speech tokenization as a selective information bottleneck to discard noise and achieve 12.5 tokens per second with better quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CleanCodec as a denoising neural audio codec that learns to retain only perceptually important features while discarding background noise and artifacts. It does this by treating tokenization as a selective information bottleneck rather than trying to reconstruct every detail of the input signal. The approach is evaluated at a low rate of 12.5 tokens per second and shows gains in speaker similarity and speech intelligibility over prior codecs. Downstream tests on text-to-speech and voice conversion confirm better results along with substantially faster inference. The core idea matters because existing codecs often waste tokens on imperceptible details that do not help later modeling steps.

Core claim

CleanCodec is a denoising audio codec which learns to encode only perceptually important features and discard imperceptible information. At just 12.5 tokens per second, it achieves state-of-the-art tokenization efficiency and substantially outperforms existing codecs in speaker similarity and speech intelligibility. Evaluations on downstream text-to-speech and voice conversion tasks further demonstrate improved performance and up to 17x faster inference.

What carries the argument

The selective information bottleneck that forces the codec to discard imperceptible information such as background noise and recording artifacts.

If this is right

Reconstructed speech exhibits higher speaker similarity and intelligibility at low token rates.
Downstream text-to-speech and voice conversion models achieve better output quality.
Inference speed in downstream pipelines increases by up to 17 times.
Speech processing pipelines require fewer tokens overall while maintaining or improving task performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bottleneck principle could be tested on music or environmental audio where noise removal is also useful.
Lower token rates might allow longer context windows in autoregressive models trained on the tokens.
If the denoising step generalizes, similar selective encoding could reduce token counts in other modalities such as video.

Load-bearing premise

A learned denoising model can reliably keep only perceptually important speech content without losing linguistically or acoustically meaningful information.

What would settle it

An evaluation on noisy speech where CleanCodec produces lower intelligibility or speaker similarity scores than a non-denoising codec at the same token rate.

Figures

Figures reproduced from arXiv: 2606.04418 by Eugene Kwek, Feng Liu, Rui Zhang, Wenpeng Yin.

**Figure 2.** Figure 2: Architecture of CleanCodec, along with loss [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Neural audio codecs are a key component of speech processing pipelines, compressing audio into discrete tokens for downstream modeling. However, existing codecs struggle to balance reconstruction quality with token efficiency, often encoding perceptually irrelevant information such as background noise and recording artifacts at the expense of linguistically and acoustically meaningful content. We reframe audio tokenization as a selective information bottleneck problem and propose CleanCodec, a denoising audio codec which learns to encode only perceptually important features and discard imperceptible information. At just 12.5 tokens per second, CleanCodec achieves state-of-the-art tokenization efficiency, substantially outperforming existing codecs in speaker similarity and speech intelligibility. Evaluations on downstream text-to-speech and voice conversion tasks further demonstrate improved performance and up to 17x faster inference, highlighting significant efficiency gains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CleanCodec reframes tokenization as a selective bottleneck via denoising to hit 12.5 tokens/sec with claimed quality and speed gains, but the abstract gives no equations or architecture to verify the separation actually works.

read the letter

The main takeaway is that this paper proposes CleanCodec as a denoising codec that learns to drop background noise and artifacts while keeping perceptually relevant content for speech tokens. It reports 12.5 tokens per second with better speaker similarity and intelligibility than prior codecs, plus gains on TTS and voice conversion and up to 17x faster inference.

What is new is the explicit reframing of the problem as a selective information bottleneck enforced through a denoising objective. That is a reasonable angle on top of existing neural codec work, and the downstream efficiency numbers are the kind of practical result that matters for modeling pipelines.

The paper does well at identifying the waste in current codecs and at showing concrete speedups in real tasks. The efficiency claim and the inference improvement are the strongest parts based on the summary.

The soft spot is that the abstract provides no loss terms, architecture details, or ablation results to show how the selective discarding is actually enforced rather than assumed. The central assumption—that the model reliably keeps linguistic and speaker content while dropping only imperceptible noise—needs the methods section and controls to hold up. Without those, the separation could be fragile.

This is for people building speech tokenizers or downstream models who need lower token rates. It deserves a serious referee because the efficiency problem is real and the proposed direction is plausible, even if the current description is high-level and the full experiments will decide the strength.

Referee Report

2 major / 0 minor

Summary. The paper introduces CleanCodec, a denoising neural audio codec that reframes audio tokenization as a selective information bottleneck problem. It learns to encode only perceptually important features while discarding imperceptible information such as background noise and artifacts. The central claim is that at 12.5 tokens per second, CleanCodec achieves state-of-the-art tokenization efficiency, substantially outperforming existing codecs on speaker similarity and speech intelligibility, with further gains on downstream text-to-speech and voice conversion tasks including up to 17x faster inference.

Significance. If the empirical claims hold under detailed scrutiny, the work could meaningfully advance efficient and robust speech tokenization for downstream modeling pipelines. The perceptual guidance via denoising offers a principled way to reduce token rates without sacrificing linguistic or speaker content, which would benefit scaling of audio generative models and real-time applications. Strengths include the focus on a concrete efficiency metric (12.5 tokens/sec) and explicit downstream task evaluations.

major comments (2)

The provided manuscript text consists only of the abstract and contains no architecture diagrams, loss formulations, training objectives, or equations describing how the selective information bottleneck is enforced (e.g., no details on the denoising objective or perceptual loss terms). This prevents verification of whether the method reliably separates perceptually important features from noise as claimed.
[Abstract] The abstract asserts 'substantially outperforming existing codecs' and 'state-of-the-art tokenization efficiency' at 12.5 tokens per second, but supplies no quantitative metrics, baseline comparisons, or table references to support the magnitude of gains in speaker similarity or intelligibility. Without these, the central efficiency-quality tradeoff claim cannot be evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the review and the opportunity to clarify aspects of the manuscript. We respond to each major comment below.

read point-by-point responses

Referee: The provided manuscript text consists only of the abstract and contains no architecture diagrams, loss formulations, training objectives, or equations describing how the selective information bottleneck is enforced (e.g., no details on the denoising objective or perceptual loss terms). This prevents verification of whether the method reliably separates perceptually important features from noise as claimed.

Authors: The version provided for review was limited to the abstract. The complete manuscript contains architecture diagrams, loss formulations, training objectives, and equations detailing the selective information bottleneck, denoising objective, and perceptual loss terms. We will submit the full manuscript with these technical details in the revision. revision: yes
Referee: [Abstract] The abstract asserts 'substantially outperforming existing codecs' and 'state-of-the-art tokenization efficiency' at 12.5 tokens per second, but supplies no quantitative metrics, baseline comparisons, or table references to support the magnitude of gains in speaker similarity or intelligibility. Without these, the central efficiency-quality tradeoff claim cannot be evaluated.

Authors: Abstracts are concise summaries and conventionally omit specific quantitative metrics and table references. The full manuscript includes tables with baseline comparisons and metrics on speaker similarity and intelligibility at 12.5 tokens per second. We can add a brief reference to the relevant evaluation table within the abstract if the referee recommends it. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and context describe CleanCodec as a reframing of tokenization into a selective information bottleneck via a denoising objective, with performance claims resting on the proposed architecture and external downstream evaluations. No equations, fitted parameters presented as predictions, self-citations, or uniqueness theorems appear in the text. The derivation chain is therefore self-contained against external benchmarks with no load-bearing steps that reduce to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only; main unstated premise is learnability of perceptual importance without explicit supervision or labels.

axioms (1)

domain assumption Perceptually important features can be distinguished from imperceptible information such as noise and artifacts via learned encoding
Core premise enabling the selective information bottleneck and denoising behavior.

pith-pipeline@v0.9.1-grok · 5670 in / 1087 out tokens · 24620 ms · 2026-06-28T05:18:31.608905+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 41 canonical work pages · 6 internal anchors

[1]

Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling

Zhijie Huang and Stephen McIntosh and Daisuke Saito and Nobuaki Minematsu , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2602.00594 , eprinttype =. 2602.00594 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.00594 2026
[2]

M., Qian, X., and Yoon, B.-J

Hyunjun Heo and Ui. NeXt-TDNN: Modernizing Multi-Scale Temporal Convolution Backbone for Speaker Verification , booktitle =. 2024 , url =. doi:10.1109/ICASSP48485.2024.10447037 , timestamp =

work page doi:10.1109/icassp48485.2024.10447037 2024
[3]

Attentive Statistics Pooling for Deep Speaker Embedding , booktitle =

Koji Okabe and Takafumi Koshinaka and Koichi Shinoda , editor =. Attentive Statistics Pooling for Deep Speaker Embedding , booktitle =. 2018 , url =. doi:10.21437/INTERSPEECH.2018-993 , timestamp =

work page doi:10.21437/interspeech.2018-993 2018
[4]

Teaching CLIP to count to ten

William Peebles and Saining Xie , title =. 2023 , url =. doi:10.1109/ICCV51070.2023.00387 , timestamp =

work page doi:10.1109/iccv51070.2023.00387 2023
[5]

Incremental

Nithin Rao Koluguri and Taejin Park and Boris Ginsburg , title =. 2022 , url =. doi:10.1109/ICASSP43922.2022.9746806 , timestamp =

work page doi:10.1109/icassp43922.2022.9746806 2022
[6]

CoRR , volume =

Rongkun Xue and Yazhe Niu and Shuai Hu and Zixin Yin and Yongqiang Yao and Jing Yang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.18897 , eprinttype =. 2507.18897 , timestamp =

work page doi:10.48550/arxiv.2507.18897 2025
[7]

Lindell, Dave Van Veen, Jeong Joon Park, and Gordon Wetzstein

Zhuang Liu and Hanzi Mao and Chao. A ConvNet for the 2020s , booktitle =. 2022 , url =. doi:10.1109/CVPR52688.2022.01167 , timestamp =

work page doi:10.1109/cvpr52688.2022.01167 2022
[8]

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Xinsheng Wang and Mingqi Jiang and Ziyang Ma and Ziyu Zhang and Songxiang Liu and Linqin Li and Zheng Liang and Qixi Zheng and Rui Wang and Xiaoqin Feng and Weizhen Bian and Zhen Ye and Sitong Cheng and Ruibin Yuan and Zhixian Zhao and Xinfa Zhu and Jiahao Pan and Liumeng Xue and Pengcheng Zhu and Yunlin Chen and Zhifei Li and Xie Chen and Lei Xie and Yik...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.01710 2025
[9]

Zein Shaheen, Tasnima Sadekova, Yulia Matveeva, Alexandra Shirshova, and Mikhail Kudinov

Brecht Desplanques and Jenthe Thienpondt and Kris Demuynck , editor =. 21st Annual Conference of the International Speech Communication Association, Interspeech 2020, Virtual Event, Shanghai, China, October 25-29, 2020 , pages =. 2020 , url =. doi:10.21437/INTERSPEECH.2020-2650 , timestamp =

work page doi:10.21437/interspeech.2020-2650 2020
[10]

11 Task Vectors for Emotion Control in LM-TTS Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck

David Snyder and Daniel Garcia. X-Vectors: Robust. 2018. 2018 , url =. doi:10.1109/ICASSP.2018.8461375 , timestamp =

work page doi:10.1109/icassp.2018.8461375 2018
[11]

doi:10.1109/JSTSP.2022.3188113 , abstract =

Sanyuan Chen and Chengyi Wang and Zhengyang Chen and Yu Wu and Shujie Liu and Zhuo Chen and Jinyu Li and Naoyuki Kanda and Takuya Yoshioka and Xiong Xiao and Jian Wu and Long Zhou and Shuo Ren and Yanmin Qian and Yao Qian and Jian Wu and Michael Zeng and Xiangzhan Yu and Furu Wei , title =. 2022 , url =. doi:10.1109/JSTSP.2022.3188113 , timestamp =

work page doi:10.1109/jstsp.2022.3188113 2022
[12]

IEEE/ACM Trans

Wei. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , journal =. 2021 , url =. doi:10.1109/TASLP.2021.3122291 , timestamp =

work page doi:10.1109/taslp.2021.3122291 2021
[13]

wav2vec 2.0:

Alexei Baevski and Yuhao Zhou and Abdelrahman Mohamed and Michael Auli , editor =. wav2vec 2.0:. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual , year =

2020
[14]

Neural Discrete Representation Learning , booktitle =

A. Neural Discrete Representation Learning , booktitle =. 2017 , url =

2017
[15]

The Twelfth International Conference on Learning Representations,

Hubert Siuzdak , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[16]

The Thirteenth International Conference on Learning Representations,

Shengpeng Ji and Ziyue Jiang and Wen Wang and Yifu Chen and Minghui Fang and Jialong Zuo and Qian Yang and Xize Cheng and Zehan Wang and Ruiqi Li and Ziang Zhang and Xiaoda Yang and Rongjie Huang and Yidi Jiang and Qian Chen and Siqi Zheng and Zhou Zhao , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025
[17]

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , booktitle =

Jungil Kong and Jaehyeon Kim and Jaekyoung Bae , editor =. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , booktitle =. 2020 , url =

2020
[18]

High Fidelity Neural Audio Compression , journal =

Alexandre D. High Fidelity Neural Audio Compression , journal =. 2023 , url =

2023
[19]

High-Fidelity Audio Compression with Improved

Rithesh Kumar and Prem Seetharaman and Alejandro Luebs and Ishaan Kumar and Kundan Kumar , editor =. High-Fidelity Audio Compression with Improved. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , year =

2023
[20]

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation , booktitle =

Hanzhao Li and Liumeng Xue and Haohan Guo and Xinfa Zhu and Yuanjun Lv and Lei Xie and Yunlin Chen and Hao Yin and Zhifei Li , editor =. Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation , booktitle =. 2024 , url =. doi:10.21437/INTERSPEECH.2024-1559 , timestamp =

work page doi:10.21437/interspeech.2024-1559 2024
[21]

CoRR , volume =

Luca Della Libera and Francesco Paissan and Cem Subakan and Mirco Ravanelli , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.04465 , eprinttype =. 2502.04465 , timestamp =

work page doi:10.48550/arxiv.2502.04465 2025
[22]

URLhttps://arxiv.org/abs/2508.19205

Zhiliang Peng and Jianwei Yu and Wenhui Wang and Yaoyao Chang and Yutao Sun and Li Dong and Yi Zhu and Weijiang Xu and Hangbo Bao and Zehua Wang and Shaohan Huang and Yan Xia and Furu Wei , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2508.19205 , eprinttype =. 2508.19205 , timestamp =

work page doi:10.48550/arxiv.2508.19205 2025
[23]

The Twelfth International Conference on Learning Representations,

Fabian Mentzer and David Minnen and Eirikur Agustsson and Michael Tschannen , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[24]

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis , journal =

Zhen Ye and Xinfa Zhu and Chi. Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis , journal =. 2025 , url =. doi:10.48550/ARXIV.2502.04128 , eprinttype =. 2502.04128 , timestamp =

work page doi:10.48550/arxiv.2502.04128 2025
[25]

UnivNet:

Won Jang and Dan Lim and Jaesam Yoon and Bongwan Kim and Juntae Kim , editor =. UnivNet:. 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30 - September 3, 2021 , pages =. 2021 , url =. doi:10.21437/INTERSPEECH.2021-1016 , timestamp =

work page doi:10.21437/interspeech.2021-1016 2021
[26]

BigVGAN:

Sang. BigVGAN:. The Eleventh International Conference on Learning Representations,. 2023 , url =

2023
[27]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre D. Moshi: a speech-text foundation model for real-time dialogue , journal =. 2024 , url =. doi:10.48550/ARXIV.2410.00037 , eprinttype =. 2410.00037 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.00037 2024
[28]

Qwen3-TTS Technical Report

Qwen Team , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2601.15621 , eprinttype =. 2601.15621 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.15621 2026
[29]

Bartley and Nikolay Karpov and Jagadeesh Balam and Boris Ginsburg , title =

Monica Sekoyan and Nithin Rao Koluguri and Nune Tadevosyan and Piotr Zelasko and Travis M. Bartley and Nikolay Karpov and Jagadeesh Balam and Boris Ginsburg , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.14128 , eprinttype =. 2509.14128 , timestamp =

work page doi:10.48550/arxiv.2509.14128 2025
[30]

23rd Annual Conference of the International Speech Communication Association, Interspeech 2022, Incheon, Korea, September 18-22, 2022 , pages =

Takaaki Saeki and Detai Xin and Wataru Nakata and Tomoki Koriyama and Shinnosuke Takamichi and Hiroshi Saruwatari , editor =. 23rd Annual Conference of the International Speech Communication Association, Interspeech 2022, Incheon, Korea, September 18-22, 2022 , pages =. 2022 , url =. doi:10.21437/INTERSPEECH.2022-439 , timestamp =

work page doi:10.21437/interspeech.2022-439 2022
[31]

Asymptotic

Anurag Kumar and Ke Tan and Zhaoheng Ni and Pranay Manocha and Xiaohui Zhang and Ethan Henderson and Buye Xu , title =. 2023 , url =. doi:10.1109/ICASSP49357.2023.10096680 , timestamp =

work page doi:10.1109/icassp49357.2023.10096680 2023
[32]

Reshape Dimensions Network for Speaker Recognition , booktitle =

Ivan Yakovlev and Rostislav Makarov and Andrei Balykin and Pavel Malov and Anton Okhotnikov and Nikita Torgashov , editor =. Reshape Dimensions Network for Speaker Recognition , booktitle =. 2024 , url =. doi:10.21437/INTERSPEECH.2024-2116 , timestamp =

work page doi:10.21437/interspeech.2024-2116 2024
[33]

C Users J

Gage, Philip , title =. C Users J. , month = feb, pages =. 1994 , issue_date =

1994
[34]

Gray , title=

Robert M. Gray , title=. IEEE Assp Magazine , volume=. 1984 , publisher=

1984
[35]

Sensors , volume =

Yongjian Chen and Tao Guan and Cheng Wang , title =. Sensors , volume =. 2010 , url =. doi:10.3390/S101211259 , timestamp =

work page doi:10.3390/s101211259 2010
[36]

Neural computation , volume=

Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=

1997
[37]

LibriTTS-R:

Yuma Koizumi and Heiga Zen and Shigeki Karita and Yifan Ding and Kohei Yatabe and Nobuyuki Morioka and Michiel Bacchiani and Yu Zhang and Wei Han and Ankur Bapna , editor =. LibriTTS-R:. 24th Annual Conference of the International Speech Communication Association, Interspeech 2023, Dublin, Ireland, August 20-24, 2023 , pages =. 2023 , url =. doi:10.21437/...

work page doi:10.21437/interspeech.2023-1584 2023
[38]

CoRR , volume =

Haorui He and Zengqiang Shang and Chaoren Wang and Xuyuan Li and Yicheng Gu and Hua Hua and Liwei Liu and Chen Yang and Jiaqi Li and Peiyang Shi and Yuancheng Wang and Kai Chen and Pengyuan Zhang and Zhizheng Wu , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2501.15907 , eprinttype =. 2501.15907 , timestamp =

work page doi:10.48550/arxiv.2501.15907 2025
[39]

GigaSpeech: An Evolving, Multi-Domain

Guoguo Chen and Shuzhou Chai and Guan. GigaSpeech: An Evolving, Multi-Domain. 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30 - September 3, 2021 , pages =. 2021 , url =. doi:10.21437/INTERSPEECH.2021-1965 , timestamp =

work page doi:10.21437/interspeech.2021-1965 2021
[40]

Expresso:

Tu Anh Nguyen and Wei. Expresso:. 24th Annual Conference of the International Speech Communication Association, Interspeech 2023, Dublin, Ireland, August 20-24, 2023 , pages =. 2023 , url =. doi:10.21437/INTERSPEECH.2023-1905 , timestamp =

work page doi:10.21437/interspeech.2023-1905 2023
[41]

2022 , url =

Alexis Conneau and Min Ma and Simran Khanuja and Yu Zhang and Vera Axelrod and Siddharth Dalmia and Jason Riesa and Clara Rivera and Ankur Bapna , title =. 2022 , url =. doi:10.1109/SLT54892.2023.10023141 , timestamp =

work page doi:10.1109/slt54892.2023.10023141 2022
[42]

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models , booktitle =

Zeqian Ju and Yuancheng Wang and Kai Shen and Xu Tan and Detai Xin and Dongchao Yang and Eric Liu and Yichong Leng and Kaitao Song and Siliang Tang and Zhizheng Wu and Tao Qin and Xiangyang Li and Wei Ye and Shikun Zhang and Jiang Bian and Lei He and Jinyu Li and Sheng Zhao , editor =. NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and ...

2024
[43]

M., Qian, X., and Yoon, B.-J

Yong Ren and Tao Wang and Jiangyan Yi and Le Xu and Jianhua Tao and Chu Yuan Zhang and Junzuo Zhou , title =. 2024 , url =. doi:10.1109/ICASSP48485.2024.10448454 , timestamp =

work page doi:10.1109/icassp48485.2024.10448454 2024
[44]

CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92)

Yamagishi, Junichi and Veaux, Christophe and MacDonald, Kirsten. CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92)
[45]

Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu

Heiga Zen and Viet Dang and Rob Clark and Yu Zhang and Ron J. Weiss and Ye Jia and Zhifeng Chen and Yonghui Wu , editor =. LibriTTS:. 20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, September 15-19, 2019 , pages =. 2019 , url =. doi:10.21437/INTERSPEECH.2019-2441 , timestamp =

work page doi:10.21437/interspeech.2019-2441 2019
[46]

Simple and Controllable Music Generation , booktitle =

Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre D. Simple and Controllable Music Generation , booktitle =. 2023 , url =

2023
[47]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Philip Anastassiou and Jiawei Chen and Jitong Chen and Yuanzhe Chen and Zhuo Chen and Ziyi Chen and Jian Cong and Lelai Deng and Chuang Ding and Lu Gao and Mingqing Gong and Peisong Huang and Qingqing Huang and Zhiying Huang and Yuanyuan Huo and Dongya Jia and Chumin Li and Feiya Li and Hui Li and Jiaxin Li and Xiaoyang Li and Xingxing Li and Lin Liu and ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.02430 2024
[48]

CoRR , volume =

Wataru Nakata and Yuki Saito and Yota Ueda and Hiroshi Saruwatari , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.17052 , eprinttype =. 2509.17052 , timestamp =

work page doi:10.48550/arxiv.2509.17052 2025
[49]

Robin Scheibler and Eric Bezzam and Ivan Dokmanic , title =. 2018. 2018 , url =. doi:10.1109/ICASSP.2018.8461310 , timestamp =

work page doi:10.1109/icassp.2018.8461310 2018
[50]

Audio Set: An ontology and human-labeled dataset for audio events

Jort F. Gemmeke and Daniel P. W. Ellis and Dylan Freedman and Aren Jansen and Wade Lawrence and R. Channing Moore and Manoj Plakal and Marvin Ritter , title =. 2017. 2017 , url =. doi:10.1109/ICASSP.2017.7952261 , timestamp =

work page doi:10.1109/icassp.2017.7952261 2017
[51]

FSD50K: an open dataset of human-labeled sound events,

Eduardo Fonseca and Xavier Favory and Jordi Pons and Frederic Font and Xavier Serra , title =. 2022 , url =. doi:10.1109/TASLP.2021.3133208 , timestamp =

work page doi:10.1109/taslp.2021.3133208 2022
[52]

WHAM!: Extending Speech Separation to Noisy Environments , booktitle =

Gordon Wichern and Joe Antognini and Michael Flynn and Licheng Richard Zhu and Emmett McQuinn and Dwight Crow and Ethan Manilow and Jonathan Le Roux , editor =. WHAM!: Extending Speech Separation to Noisy Environments , booktitle =. 2019 , url =. doi:10.21437/INTERSPEECH.2019-2821 , timestamp =

work page doi:10.21437/interspeech.2019-2821 2019
[53]

CoRR , volume =

Yao Shi and Hui Bu and Xin Xu and Shaoji Zhang and Ming Li , title =. CoRR , volume =. 2020 , url =. 2010.11567 , timestamp =

arXiv 2020
[54]

doi:10.21437/interspeech.2017-950 , booktitle=

Arsha Nagrani and Joon Son Chung and Andrew Zisserman , editor =. VoxCeleb:. 18th Annual Conference of the International Speech Communication Association, Interspeech 2017, Stockholm, Sweden, August 20-24, 2017 , pages =. 2017 , url =. doi:10.21437/INTERSPEECH.2017-950 , timestamp =

work page doi:10.21437/interspeech.2017-950 2017
[55]

The Twelfth International Conference on Learning Representations,

Xin Zhang and Dong Zhang and Shimin Li and Yaqian Zhou and Xipeng Qiu , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[56]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang and Sanyuan Chen and Yu Wu and Ziqiang Zhang and Long Zhou and Shujie Liu and Zhuo Chen and Yanqing Liu and Huaming Wang and Jinyu Li and Lei He and Sheng Zhao and Furu Wei , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2301.02111 , eprinttype =. 2301.02111 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2301.02111 2023
[57]

Parker and Anton Smirnov and Jordi Pons and CJ Carr and Zack Zukowski and Zach Evans and Xubo Liu , title =

Julian D. Parker and Anton Smirnov and Jordi Pons and CJ Carr and Zack Zukowski and Zach Evans and Xubo Liu , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025
[58]

DualCodec:

Jiaqi Li and Xiaolong Lin and Zhekai Li and Shixi Huang and Yuancheng Wang and Chaoren Wang and Zhenpeng Zhan and Zhizheng Wu , editor =. DualCodec:. 26th Annual Conference of the International Speech Communication Association, Interspeech 2025, Rotterdam, The Netherlands, 17-21 August 2025 , publisher =. 2025 , url =. doi:10.21437/INTERSPEECH.2025-468 , ...

work page doi:10.21437/interspeech.2025-468 2025
[59]

CoRR , volume =

Yu Pan and Lei Ma and Jianjun Zhao , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2404.02702 , eprinttype =. 2404.02702 , timestamp =

work page doi:10.48550/arxiv.2404.02702 2024

[1] [1]

Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling

Zhijie Huang and Stephen McIntosh and Daisuke Saito and Nobuaki Minematsu , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2602.00594 , eprinttype =. 2602.00594 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.00594 2026

[2] [2]

M., Qian, X., and Yoon, B.-J

Hyunjun Heo and Ui. NeXt-TDNN: Modernizing Multi-Scale Temporal Convolution Backbone for Speaker Verification , booktitle =. 2024 , url =. doi:10.1109/ICASSP48485.2024.10447037 , timestamp =

work page doi:10.1109/icassp48485.2024.10447037 2024

[3] [3]

Attentive Statistics Pooling for Deep Speaker Embedding , booktitle =

Koji Okabe and Takafumi Koshinaka and Koichi Shinoda , editor =. Attentive Statistics Pooling for Deep Speaker Embedding , booktitle =. 2018 , url =. doi:10.21437/INTERSPEECH.2018-993 , timestamp =

work page doi:10.21437/interspeech.2018-993 2018

[4] [4]

Teaching CLIP to count to ten

William Peebles and Saining Xie , title =. 2023 , url =. doi:10.1109/ICCV51070.2023.00387 , timestamp =

work page doi:10.1109/iccv51070.2023.00387 2023

[5] [5]

Incremental

Nithin Rao Koluguri and Taejin Park and Boris Ginsburg , title =. 2022 , url =. doi:10.1109/ICASSP43922.2022.9746806 , timestamp =

work page doi:10.1109/icassp43922.2022.9746806 2022

[6] [6]

CoRR , volume =

Rongkun Xue and Yazhe Niu and Shuai Hu and Zixin Yin and Yongqiang Yao and Jing Yang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.18897 , eprinttype =. 2507.18897 , timestamp =

work page doi:10.48550/arxiv.2507.18897 2025

[7] [7]

Lindell, Dave Van Veen, Jeong Joon Park, and Gordon Wetzstein

Zhuang Liu and Hanzi Mao and Chao. A ConvNet for the 2020s , booktitle =. 2022 , url =. doi:10.1109/CVPR52688.2022.01167 , timestamp =

work page doi:10.1109/cvpr52688.2022.01167 2022

[8] [8]

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Xinsheng Wang and Mingqi Jiang and Ziyang Ma and Ziyu Zhang and Songxiang Liu and Linqin Li and Zheng Liang and Qixi Zheng and Rui Wang and Xiaoqin Feng and Weizhen Bian and Zhen Ye and Sitong Cheng and Ruibin Yuan and Zhixian Zhao and Xinfa Zhu and Jiahao Pan and Liumeng Xue and Pengcheng Zhu and Yunlin Chen and Zhifei Li and Xie Chen and Lei Xie and Yik...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.01710 2025

[9] [9]

Zein Shaheen, Tasnima Sadekova, Yulia Matveeva, Alexandra Shirshova, and Mikhail Kudinov

Brecht Desplanques and Jenthe Thienpondt and Kris Demuynck , editor =. 21st Annual Conference of the International Speech Communication Association, Interspeech 2020, Virtual Event, Shanghai, China, October 25-29, 2020 , pages =. 2020 , url =. doi:10.21437/INTERSPEECH.2020-2650 , timestamp =

work page doi:10.21437/interspeech.2020-2650 2020

[10] [10]

11 Task Vectors for Emotion Control in LM-TTS Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck

David Snyder and Daniel Garcia. X-Vectors: Robust. 2018. 2018 , url =. doi:10.1109/ICASSP.2018.8461375 , timestamp =

work page doi:10.1109/icassp.2018.8461375 2018

[11] [11]

doi:10.1109/JSTSP.2022.3188113 , abstract =

Sanyuan Chen and Chengyi Wang and Zhengyang Chen and Yu Wu and Shujie Liu and Zhuo Chen and Jinyu Li and Naoyuki Kanda and Takuya Yoshioka and Xiong Xiao and Jian Wu and Long Zhou and Shuo Ren and Yanmin Qian and Yao Qian and Jian Wu and Michael Zeng and Xiangzhan Yu and Furu Wei , title =. 2022 , url =. doi:10.1109/JSTSP.2022.3188113 , timestamp =

work page doi:10.1109/jstsp.2022.3188113 2022

[12] [12]

IEEE/ACM Trans

Wei. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , journal =. 2021 , url =. doi:10.1109/TASLP.2021.3122291 , timestamp =

work page doi:10.1109/taslp.2021.3122291 2021

[13] [13]

wav2vec 2.0:

Alexei Baevski and Yuhao Zhou and Abdelrahman Mohamed and Michael Auli , editor =. wav2vec 2.0:. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual , year =

2020

[14] [14]

Neural Discrete Representation Learning , booktitle =

A. Neural Discrete Representation Learning , booktitle =. 2017 , url =

2017

[15] [15]

The Twelfth International Conference on Learning Representations,

Hubert Siuzdak , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024

[16] [16]

The Thirteenth International Conference on Learning Representations,

Shengpeng Ji and Ziyue Jiang and Wen Wang and Yifu Chen and Minghui Fang and Jialong Zuo and Qian Yang and Xize Cheng and Zehan Wang and Ruiqi Li and Ziang Zhang and Xiaoda Yang and Rongjie Huang and Yidi Jiang and Qian Chen and Siqi Zheng and Zhou Zhao , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025

[17] [17]

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , booktitle =

Jungil Kong and Jaehyeon Kim and Jaekyoung Bae , editor =. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , booktitle =. 2020 , url =

2020

[18] [18]

High Fidelity Neural Audio Compression , journal =

Alexandre D. High Fidelity Neural Audio Compression , journal =. 2023 , url =

2023

[19] [19]

High-Fidelity Audio Compression with Improved

Rithesh Kumar and Prem Seetharaman and Alejandro Luebs and Ishaan Kumar and Kundan Kumar , editor =. High-Fidelity Audio Compression with Improved. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , year =

2023

[20] [20]

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation , booktitle =

Hanzhao Li and Liumeng Xue and Haohan Guo and Xinfa Zhu and Yuanjun Lv and Lei Xie and Yunlin Chen and Hao Yin and Zhifei Li , editor =. Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation , booktitle =. 2024 , url =. doi:10.21437/INTERSPEECH.2024-1559 , timestamp =

work page doi:10.21437/interspeech.2024-1559 2024

[21] [21]

CoRR , volume =

Luca Della Libera and Francesco Paissan and Cem Subakan and Mirco Ravanelli , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.04465 , eprinttype =. 2502.04465 , timestamp =

work page doi:10.48550/arxiv.2502.04465 2025

[22] [22]

URLhttps://arxiv.org/abs/2508.19205

Zhiliang Peng and Jianwei Yu and Wenhui Wang and Yaoyao Chang and Yutao Sun and Li Dong and Yi Zhu and Weijiang Xu and Hangbo Bao and Zehua Wang and Shaohan Huang and Yan Xia and Furu Wei , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2508.19205 , eprinttype =. 2508.19205 , timestamp =

work page doi:10.48550/arxiv.2508.19205 2025

[23] [23]

The Twelfth International Conference on Learning Representations,

Fabian Mentzer and David Minnen and Eirikur Agustsson and Michael Tschannen , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024

[24] [24]

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis , journal =

Zhen Ye and Xinfa Zhu and Chi. Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis , journal =. 2025 , url =. doi:10.48550/ARXIV.2502.04128 , eprinttype =. 2502.04128 , timestamp =

work page doi:10.48550/arxiv.2502.04128 2025

[25] [25]

UnivNet:

Won Jang and Dan Lim and Jaesam Yoon and Bongwan Kim and Juntae Kim , editor =. UnivNet:. 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30 - September 3, 2021 , pages =. 2021 , url =. doi:10.21437/INTERSPEECH.2021-1016 , timestamp =

work page doi:10.21437/interspeech.2021-1016 2021

[26] [26]

BigVGAN:

Sang. BigVGAN:. The Eleventh International Conference on Learning Representations,. 2023 , url =

2023

[27] [27]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre D. Moshi: a speech-text foundation model for real-time dialogue , journal =. 2024 , url =. doi:10.48550/ARXIV.2410.00037 , eprinttype =. 2410.00037 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.00037 2024

[28] [28]

Qwen3-TTS Technical Report

Qwen Team , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2601.15621 , eprinttype =. 2601.15621 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.15621 2026

[29] [29]

Bartley and Nikolay Karpov and Jagadeesh Balam and Boris Ginsburg , title =

Monica Sekoyan and Nithin Rao Koluguri and Nune Tadevosyan and Piotr Zelasko and Travis M. Bartley and Nikolay Karpov and Jagadeesh Balam and Boris Ginsburg , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.14128 , eprinttype =. 2509.14128 , timestamp =

work page doi:10.48550/arxiv.2509.14128 2025

[30] [30]

23rd Annual Conference of the International Speech Communication Association, Interspeech 2022, Incheon, Korea, September 18-22, 2022 , pages =

Takaaki Saeki and Detai Xin and Wataru Nakata and Tomoki Koriyama and Shinnosuke Takamichi and Hiroshi Saruwatari , editor =. 23rd Annual Conference of the International Speech Communication Association, Interspeech 2022, Incheon, Korea, September 18-22, 2022 , pages =. 2022 , url =. doi:10.21437/INTERSPEECH.2022-439 , timestamp =

work page doi:10.21437/interspeech.2022-439 2022

[31] [31]

Asymptotic

Anurag Kumar and Ke Tan and Zhaoheng Ni and Pranay Manocha and Xiaohui Zhang and Ethan Henderson and Buye Xu , title =. 2023 , url =. doi:10.1109/ICASSP49357.2023.10096680 , timestamp =

work page doi:10.1109/icassp49357.2023.10096680 2023

[32] [32]

Reshape Dimensions Network for Speaker Recognition , booktitle =

Ivan Yakovlev and Rostislav Makarov and Andrei Balykin and Pavel Malov and Anton Okhotnikov and Nikita Torgashov , editor =. Reshape Dimensions Network for Speaker Recognition , booktitle =. 2024 , url =. doi:10.21437/INTERSPEECH.2024-2116 , timestamp =

work page doi:10.21437/interspeech.2024-2116 2024

[33] [33]

C Users J

Gage, Philip , title =. C Users J. , month = feb, pages =. 1994 , issue_date =

1994

[34] [34]

Gray , title=

Robert M. Gray , title=. IEEE Assp Magazine , volume=. 1984 , publisher=

1984

[35] [35]

Sensors , volume =

Yongjian Chen and Tao Guan and Cheng Wang , title =. Sensors , volume =. 2010 , url =. doi:10.3390/S101211259 , timestamp =

work page doi:10.3390/s101211259 2010

[36] [36]

Neural computation , volume=

Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=

1997

[37] [37]

LibriTTS-R:

Yuma Koizumi and Heiga Zen and Shigeki Karita and Yifan Ding and Kohei Yatabe and Nobuyuki Morioka and Michiel Bacchiani and Yu Zhang and Wei Han and Ankur Bapna , editor =. LibriTTS-R:. 24th Annual Conference of the International Speech Communication Association, Interspeech 2023, Dublin, Ireland, August 20-24, 2023 , pages =. 2023 , url =. doi:10.21437/...

work page doi:10.21437/interspeech.2023-1584 2023

[38] [38]

CoRR , volume =

Haorui He and Zengqiang Shang and Chaoren Wang and Xuyuan Li and Yicheng Gu and Hua Hua and Liwei Liu and Chen Yang and Jiaqi Li and Peiyang Shi and Yuancheng Wang and Kai Chen and Pengyuan Zhang and Zhizheng Wu , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2501.15907 , eprinttype =. 2501.15907 , timestamp =

work page doi:10.48550/arxiv.2501.15907 2025

[39] [39]

GigaSpeech: An Evolving, Multi-Domain

Guoguo Chen and Shuzhou Chai and Guan. GigaSpeech: An Evolving, Multi-Domain. 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30 - September 3, 2021 , pages =. 2021 , url =. doi:10.21437/INTERSPEECH.2021-1965 , timestamp =

work page doi:10.21437/interspeech.2021-1965 2021

[40] [40]

Expresso:

Tu Anh Nguyen and Wei. Expresso:. 24th Annual Conference of the International Speech Communication Association, Interspeech 2023, Dublin, Ireland, August 20-24, 2023 , pages =. 2023 , url =. doi:10.21437/INTERSPEECH.2023-1905 , timestamp =

work page doi:10.21437/interspeech.2023-1905 2023

[41] [41]

2022 , url =

Alexis Conneau and Min Ma and Simran Khanuja and Yu Zhang and Vera Axelrod and Siddharth Dalmia and Jason Riesa and Clara Rivera and Ankur Bapna , title =. 2022 , url =. doi:10.1109/SLT54892.2023.10023141 , timestamp =

work page doi:10.1109/slt54892.2023.10023141 2022

[42] [42]

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models , booktitle =

Zeqian Ju and Yuancheng Wang and Kai Shen and Xu Tan and Detai Xin and Dongchao Yang and Eric Liu and Yichong Leng and Kaitao Song and Siliang Tang and Zhizheng Wu and Tao Qin and Xiangyang Li and Wei Ye and Shikun Zhang and Jiang Bian and Lei He and Jinyu Li and Sheng Zhao , editor =. NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and ...

2024

[43] [43]

M., Qian, X., and Yoon, B.-J

Yong Ren and Tao Wang and Jiangyan Yi and Le Xu and Jianhua Tao and Chu Yuan Zhang and Junzuo Zhou , title =. 2024 , url =. doi:10.1109/ICASSP48485.2024.10448454 , timestamp =

work page doi:10.1109/icassp48485.2024.10448454 2024

[44] [44]

CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92)

Yamagishi, Junichi and Veaux, Christophe and MacDonald, Kirsten. CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92)

[45] [45]

Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu

Heiga Zen and Viet Dang and Rob Clark and Yu Zhang and Ron J. Weiss and Ye Jia and Zhifeng Chen and Yonghui Wu , editor =. LibriTTS:. 20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, September 15-19, 2019 , pages =. 2019 , url =. doi:10.21437/INTERSPEECH.2019-2441 , timestamp =

work page doi:10.21437/interspeech.2019-2441 2019

[46] [46]

Simple and Controllable Music Generation , booktitle =

Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre D. Simple and Controllable Music Generation , booktitle =. 2023 , url =

2023

[47] [47]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Philip Anastassiou and Jiawei Chen and Jitong Chen and Yuanzhe Chen and Zhuo Chen and Ziyi Chen and Jian Cong and Lelai Deng and Chuang Ding and Lu Gao and Mingqing Gong and Peisong Huang and Qingqing Huang and Zhiying Huang and Yuanyuan Huo and Dongya Jia and Chumin Li and Feiya Li and Hui Li and Jiaxin Li and Xiaoyang Li and Xingxing Li and Lin Liu and ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.02430 2024

[48] [48]

CoRR , volume =

Wataru Nakata and Yuki Saito and Yota Ueda and Hiroshi Saruwatari , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.17052 , eprinttype =. 2509.17052 , timestamp =

work page doi:10.48550/arxiv.2509.17052 2025

[49] [49]

Robin Scheibler and Eric Bezzam and Ivan Dokmanic , title =. 2018. 2018 , url =. doi:10.1109/ICASSP.2018.8461310 , timestamp =

work page doi:10.1109/icassp.2018.8461310 2018

[50] [50]

Audio Set: An ontology and human-labeled dataset for audio events

Jort F. Gemmeke and Daniel P. W. Ellis and Dylan Freedman and Aren Jansen and Wade Lawrence and R. Channing Moore and Manoj Plakal and Marvin Ritter , title =. 2017. 2017 , url =. doi:10.1109/ICASSP.2017.7952261 , timestamp =

work page doi:10.1109/icassp.2017.7952261 2017

[51] [51]

FSD50K: an open dataset of human-labeled sound events,

Eduardo Fonseca and Xavier Favory and Jordi Pons and Frederic Font and Xavier Serra , title =. 2022 , url =. doi:10.1109/TASLP.2021.3133208 , timestamp =

work page doi:10.1109/taslp.2021.3133208 2022

[52] [52]

WHAM!: Extending Speech Separation to Noisy Environments , booktitle =

Gordon Wichern and Joe Antognini and Michael Flynn and Licheng Richard Zhu and Emmett McQuinn and Dwight Crow and Ethan Manilow and Jonathan Le Roux , editor =. WHAM!: Extending Speech Separation to Noisy Environments , booktitle =. 2019 , url =. doi:10.21437/INTERSPEECH.2019-2821 , timestamp =

work page doi:10.21437/interspeech.2019-2821 2019

[53] [53]

CoRR , volume =

Yao Shi and Hui Bu and Xin Xu and Shaoji Zhang and Ming Li , title =. CoRR , volume =. 2020 , url =. 2010.11567 , timestamp =

arXiv 2020

[54] [54]

doi:10.21437/interspeech.2017-950 , booktitle=

Arsha Nagrani and Joon Son Chung and Andrew Zisserman , editor =. VoxCeleb:. 18th Annual Conference of the International Speech Communication Association, Interspeech 2017, Stockholm, Sweden, August 20-24, 2017 , pages =. 2017 , url =. doi:10.21437/INTERSPEECH.2017-950 , timestamp =

work page doi:10.21437/interspeech.2017-950 2017

[55] [55]

The Twelfth International Conference on Learning Representations,

Xin Zhang and Dong Zhang and Shimin Li and Yaqian Zhou and Xipeng Qiu , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024

[56] [56]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang and Sanyuan Chen and Yu Wu and Ziqiang Zhang and Long Zhou and Shujie Liu and Zhuo Chen and Yanqing Liu and Huaming Wang and Jinyu Li and Lei He and Sheng Zhao and Furu Wei , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2301.02111 , eprinttype =. 2301.02111 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2301.02111 2023

[57] [57]

Parker and Anton Smirnov and Jordi Pons and CJ Carr and Zack Zukowski and Zach Evans and Xubo Liu , title =

Julian D. Parker and Anton Smirnov and Jordi Pons and CJ Carr and Zack Zukowski and Zach Evans and Xubo Liu , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025

[58] [58]

DualCodec:

Jiaqi Li and Xiaolong Lin and Zhekai Li and Shixi Huang and Yuancheng Wang and Chaoren Wang and Zhenpeng Zhan and Zhizheng Wu , editor =. DualCodec:. 26th Annual Conference of the International Speech Communication Association, Interspeech 2025, Rotterdam, The Netherlands, 17-21 August 2025 , publisher =. 2025 , url =. doi:10.21437/INTERSPEECH.2025-468 , ...

work page doi:10.21437/interspeech.2025-468 2025

[59] [59]

CoRR , volume =

Yu Pan and Lei Ma and Jianjun Zhao , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2404.02702 , eprinttype =. 2404.02702 , timestamp =

work page doi:10.48550/arxiv.2404.02702 2024