arxiv: 2604.24416 · v1 · submitted 2026-04-27 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

Scaling Properties of Continuous Diffusion Spoken Language Models

Amitis Shidani, Bogdan Mazoure, Dan Busbridge, Eeshan Gunesh Dhekane, Jason Ramapuram, Navdeep Jaitly, Russ Webb, Tatiana Likhomanenko, Zijin Gu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords scaling lawscontinuous diffusionspoken language modelsphoneme Jensen-Shannon divergenceautoregressive modelsspeech generationmulti-speaker synthesis

0 comments

The pith

Continuous diffusion spoken language models follow the same scaling laws as autoregressive models, reaching emotive multi-speaker output at 16 billion parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether continuous diffusion can replace discrete tokenization in spoken language models, which currently lag text models due to quantization losses. It introduces phoneme Jensen-Shannon divergence to track linguistic fidelity without relying on word-level metrics. Experiments show that continuous diffusion models obey power-law improvements in both loss and this new metric as parameters and data grow, with the best token-to-parameter balance shifting downward at larger scales. At 16 billion parameters trained on tens of millions of hours of conversation, the models produce speech that carries emotion, prosody, multiple speakers, and multiple languages. Long stretches of speech still lose coherence, however.

Core claim

Continuous diffusion spoken language models exhibit scaling laws for validation loss and phoneme Jensen-Shannon divergence that mirror autoregressive behavior. The optimal token-to-parameter ratio decreases with increasing compute, yet loss itself becomes insensitive to the precise allocation of data versus parameters at large scale. Scaling to 16 billion parameters on tens of millions of hours of conversational data enables generation of emotive, prosodic, multi-speaker, multilingual speech, although long-form coherence remains a significant challenge.

What carries the argument

The phoneme Jensen-Shannon divergence (pJSD) metric, which computes divergence between phoneme probability distributions of generated and reference speech to assess linguistic quality.

If this is right

Validation loss and pJSD both improve according to power laws as model size and data increase.
The optimal ratio of tokens to parameters declines as total compute grows.
At large scales the loss becomes largely independent of exact data-to-parameter splits, opening routes to faster inference.
Models at 16 billion parameters trained on tens of millions of hours can synthesize speech carrying emotion, prosody, speaker identity, and language variety.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Continuous diffusion may bypass the information bottleneck created by discretizing raw audio waveforms.
The observed loss insensitivity could allow practitioners to trade model size for data volume without retraining from scratch.
Long-form coherence failures point to a need for explicit mechanisms that maintain context across many seconds of speech.
These scaling behaviors suggest hybrid continuous-diffusion and text-model pipelines could become practical for real-time spoken dialogue systems.

Load-bearing premise

The newly introduced phoneme Jensen-Shannon divergence metric reliably measures the linguistic quality and coherence of generated speech.

What would settle it

Training a 16-billion-parameter model that achieves low pJSD yet produces speech that human raters consistently judge as linguistically incoherent over long sequences would falsify the central claim.

read the original abstract

Speech-only spoken language models (SLMs) lag behind text and text-speech models in performance, with recent discrete autoregressive (AR) SLMs indicating significant computational and data demands to match text models. Since discretizing continuous speech for AR creates bottlenecks, we explore whether continuous diffusion (CD) SLM is more viable. To quantify the SLMs linguistic quality, we introduce the phoneme Jensen-Shannon divergence (pJSD) metric. Our analysis reveals CD SLMs, mirroring AR behavior, exhibit scaling laws for validation loss and pJSD, and show optimal token-to-parameter ratios decreasing as compute scales. However, for the latter, loss becomes insensitive to choice of data and model sizes, showing potential for fast inference. Scaling CD SLMs to 16B parameters with tens of millions of hours of conversational data enables generation of emotive, prosodic, multi-speaker, multilingual speech, though achieving long-form coherence remains a significant challenge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Continuous diffusion SLMs scale like AR ones with a new pJSD metric, but that metric's link to real linguistic quality and coherence is unproven.

read the letter

The main point is that continuous diffusion spoken language models follow scaling laws for loss and a new phoneme-based divergence metric, much like the discrete autoregressive models they compare against, and the authors scale one to 16B parameters on tens of millions of hours of data. They also note that optimal token-to-parameter ratios shift with compute and that loss becomes less sensitive to exact data/model size choices at larger scales, which could matter for fast inference. That part is straightforward empirical work worth seeing in full. What is actually new is applying continuous diffusion directly to speech modeling instead of discretizing first, plus the pJSD metric they introduce to track linguistic quality without AR tokenization. The abstract shows they get predictable improvements in both loss and pJSD with scale, and they flag long-form coherence as still hard. That framing is useful for anyone thinking about non-autoregressive alternatives in speech. The soft spot is pJSD itself. It measures Jensen-Shannon divergence over phoneme distributions from generated versus reference speech, but the abstract gives no correlation numbers with human ratings, WER, or standard speech metrics. If pJSD mainly reflects local phoneme match rather than prosody, emotion, or global structure, then the claims about generating emotive multi-speaker speech rest on an untested proxy. The paper itself says long-form coherence is unsolved, which already limits how far the scaling story can be pushed. This paper is for people working on large speech or multimodal models who want data on diffusion scaling. A reader focused on scaling laws across modalities will find the patterns and the token-ratio observations worth checking, even if the metric needs more grounding. I would send it to peer review. The idea is timely and the empirical scaling results are concrete enough to referee, but the authors should add validation for pJSD and clearer ablations before it lands.

Referee Report

2 major / 2 minor

Summary. The paper explores continuous diffusion spoken language models (CD SLMs) as an alternative to discrete autoregressive SLMs, which suffer from discretization bottlenecks. It introduces the phoneme Jensen-Shannon divergence (pJSD) metric to measure linguistic quality and reports that CD SLMs exhibit scaling laws in validation loss and pJSD similar to AR models. Optimal token-to-parameter ratios decrease with scale, but loss becomes insensitive to data/model size choices, suggesting fast inference potential. Scaling to 16B parameters on tens of millions of hours of conversational data enables emotive, prosodic, multi-speaker, multilingual speech generation, though long-form coherence remains unsolved.

Significance. If the scaling observations and pJSD-based claims hold, the work suggests CD SLMs could match or exceed AR SLMs in capability while offering computational advantages for inference. The empirical scaling laws and new metric provide a foundation for future speech-only model development, particularly in data-efficient regimes. Credit is due for the large-scale empirical analysis comparing to prior AR results and for highlighting both capabilities and remaining challenges like long-form coherence.

major comments (2)

[Abstract] Abstract: The central claims that scaling CD SLMs to 16B parameters enables emotive/prosodic/multi-speaker/multilingual speech and that pJSD tracks linguistic quality/coherence rest on pJSD decreasing with model scale. However, pJSD is defined solely as Jensen-Shannon divergence over phoneme distributions from generated vs. reference speech, with no reported correlation to human judgments, WER, prosody metrics, or long-form coherence measures. This makes it unclear whether observed scaling laws support the headline generation capabilities or primarily reflect local phoneme statistics.
[Abstract] Abstract: The statement that 'loss becomes insensitive to choice of data and model sizes, showing potential for fast inference' is presented as a key scaling property, but without details on the experimental setup (e.g., specific model sizes, data regimes, or how insensitivity was quantified), it is difficult to assess whether this holds beyond the tested range or generalizes to the 16B scale.

minor comments (2)

[Abstract] The abstract mentions 'tens of millions of hours of conversational data' but provides no breakdown of data sources, languages covered, or preprocessing steps, which would aid reproducibility.
[Abstract] No error bars, ablation details, or data splits are referenced in the high-level claims, making it hard to evaluate robustness of the scaling observations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the paper's significance. We address each major comment below with proposed clarifications to the abstract.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims that scaling CD SLMs to 16B parameters enables emotive/prosodic/multi-speaker/multilingual speech and that pJSD tracks linguistic quality/coherence rest on pJSD decreasing with model scale. However, pJSD is defined solely as Jensen-Shannon divergence over phoneme distributions from generated vs. reference speech, with no reported correlation to human judgments, WER, prosody metrics, or long-form coherence measures. This makes it unclear whether observed scaling laws support the headline generation capabilities or primarily reflect local phoneme statistics.

Authors: We appreciate this clarification request. pJSD is introduced as a phoneme-level distributional metric to quantify linguistic fidelity, chosen because it directly captures divergence in phoneme usage between generated and reference speech and exhibits scaling behavior consistent with validation loss and prior AR SLM results. The 16B-scale generation claims for emotive, prosodic, multi-speaker, and multilingual output are grounded in the qualitative examples and demonstrations provided in the paper, while pJSD supplies supporting quantitative evidence of improved phoneme statistics. We acknowledge that direct correlations to human judgments, WER, or long-form coherence are not reported. We will revise the abstract to more clearly separate the pJSD scaling results (tied to local phoneme quality) from the qualitative generation capabilities at scale. revision: partial
Referee: [Abstract] Abstract: The statement that 'loss becomes insensitive to choice of data and model sizes, showing potential for fast inference' is presented as a key scaling property, but without details on the experimental setup (e.g., specific model sizes, data regimes, or how insensitivity was quantified), it is difficult to assess whether this holds beyond the tested range or generalizes to the 16B scale.

Authors: We agree that the abstract would benefit from additional context on this point. The insensitivity observation derives from our compute-optimal scaling experiments, in which validation loss was measured across varying token-to-parameter ratios at fixed FLOPs budgets for model sizes from ~100M to multi-billion parameters and corresponding data volumes. Beyond the optimal ratio, further increases in data or model size yielded negligible loss improvements. We will revise the abstract to briefly reference these experimental conditions and direct readers to the relevant scaling analysis section for the exact model sizes, data regimes, and quantification approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity; scaling observations are empirical

full rationale

The paper introduces the pJSD metric and reports empirical scaling trends for validation loss and pJSD as model size and data increase, along with token-to-parameter ratio observations. These are presented as direct measurements on trained models rather than any derivation, prediction, or fitted quantity that reduces to its own inputs by construction. No equations, self-definitional steps, or load-bearing self-citations are used to justify the core claims; the analysis explicitly notes remaining challenges like long-form coherence and compares trends to prior AR work without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the work rests on empirical observation of scaling behavior and the validity of pJSD as a proxy.

pith-pipeline@v0.9.0 · 5495 in / 1021 out tokens · 52653 ms · 2026-05-08T03:39:29.644246+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

86 extracted references · 22 canonical work pages · 7 internal anchors

[1]

Self-supervised speech representation learning: A review.IEEE Journal of Selected Topics in Signal Processing, 16(6):1179–1210, 2022

Abdelrahman Mohamed, Hung-yi Lee, Lasse Borgholt, Jakob D Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang- Wen Li, Karen Livescu, Lars Maaløe, et al. Self-supervised speech representation learning: A review.IEEE Journal of Selected Topics in Signal Processing, 16(6):1179–1210, 2022

2022
[2]

wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

2020
[3]

Self-supervised speech representations are more phonetic than semantic

Kwanghee Choi, Ankita Pasad, Tomohiko Nakamura, Satoru Fukayama, Karen Livescu, and Shinji Watanabe. Self-supervised speech representations are more phonetic than semantic. InProc. Interspeech, pages 4578–4582, 2024

2024
[4]

On the landscape of spoken language models: A comprehensive survey.Transactions on Machine Learning Research, 2025

Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung yi Lee, Karen Livescu, and Shinji Watanabe. On the landscape of spoken language models: A comprehensive survey.Transactions on Machine Learning Research, 2025. ISSN 2835-8856

2025
[5]

On generative spoken language modeling from raw audio.Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021

Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, et al. On generative spoken language modeling from raw audio.Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021. 12

2021
[6]

The zero resource speech challenge 2021: Spoken language modelling.Interspeech 2021, 2021

Ewan Dunbar, Mathieu Bernard, Nicolas Hamilakis, Tu Anh Nguyen, Maureen de Seyssel, Patricia Rozé, Morgane Rivière, Eu- gene Kharitonov, and Emmanuel Dupoux. The zero resource speech challenge 2021: Spoken language modelling.Interspeech 2021, 2021

2021
[7]

Text-free prosody-aware generative spoken language modeling

Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu-Anh Nguyen, Morgane Riviere, Abdelrahman Mohamed, Emmanuel Dupoux, et al. Text-free prosody-aware generative spoken language modeling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8666– 8681, 2022

2022
[8]

Scaling properties of speech language models

Santiago Cuervo and Ricard Marxer. Scaling properties of speech language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 351–361, 2024

2024
[9]

Scaling analysis of interleaved speech-text language models

Gallil Maimon, Michael Hassid, Amit Roth, and Yossi Adi. Scaling analysis of interleaved speech-text language models. In Second Conference on Language Modeling, 2025

2025
[10]

Spidr: Learning fast and stable linguistic units for spoken language models without supervision.Transactions on Machine Learning Research, 2025

Maxime Poli, Mahi Luthra, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize, Jiayi Shen, Robin Algayres, Yu-An Chung, Mido Assran, Juan Pino, and Emmanuel Dupoux. Spidr: Learning fast and stable linguistic units for spoken language models without supervision.Transactions on Machine Learning Research, 2025. ISSN 2835-8856

2025
[11]

Tensor programs iv: Feature learning in infinite-width neural networks

Greg Yang and Edward J Hu. Tensor programs iv: Feature learning in infinite-width neural networks. InInternational Conference on Machine Learning, pages 11727–11737. PMLR, 2021

2021
[12]

Don’t be lazy: Completep enables compute-efficient deep transformers

Nolan Simran Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness. Don’t be lazy: Completep enables compute-efficient deep transformers. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[13]

From image to video: An empirical study of diffusion representations

Pedro Vélez, Luisa F Polanía, Yi Yang, Chuhan Zhang, Rishabh Kabra, Anurag Arnab, and Mehdi SM Sajjadi. From image to video: An empirical study of diffusion representations. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16948–16958, 2025

2025
[14]

Diffusion models in vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 45(9):10850–10869, 2023

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 45(9):10850–10869, 2023

2023
[15]

Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

2022
[16]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[17]

Autoregressive speech synthesis without vector quantization

Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, et al. Autoregressive speech synthesis without vector quantization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1287–1300, 2025

2025
[18]

Continuous audio language models

Simon Rouard, Manu Orsini, Axel Roebel, Neil Zeghidour, and Alexandre Défossez. Continuous audio language models. arXiv preprint arXiv:2509.06926, 2025

work page arXiv 2025
[19]

Textually pretrained speech language models.Advances in Neural Information Processing Systems, 36:63483–63501, 2023

Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux, et al. Textually pretrained speech language models.Advances in Neural Information Processing Systems, 36:63483–63501, 2023

2023
[20]

EmergentTTS-eval: Evaluating TTS models on complex prosodic, expressiveness, and linguistic challenges using model-as-a-judge

Ruskin Raj Manku, Yuzhi Tang, Xingjian Shi, Mu Li, and Alex Smola. EmergentTTS-eval: Evaluating TTS models on complex prosodic, expressiveness, and linguistic challenges using model-as-a-judge. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

2025
[22]

Discrete audio tokens: More than a survey!Transactions on Machine Learning Research, 2025

Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung yi Lee, Shinji Watanabe, Yossi Adi, and Mirco Ravanelli. Discrete audio tokens: More than a sur...

2025
[23]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems, volume 33, pages 6840–6851, 2020

2020
[24]

Generative modeling by estimating gradients of the data distribu- tion.arXiv preprint arXiv:1907.05600,

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.CoRR, abs/1907.05600, 2019

work page arXiv 1907
[25]

Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based gen- erative modeling through stochastic differential equations. In9th International Conference on Learning Representations, ICLR 2021, 2021

2021
[26]

Diffusion models beat GANs on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. InAdvances in Neural Information Processing Systems, volume 34, 2021

2021
[27]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learning, ICML 2024, 2024. 13

2024
[28]

Video generation models as world simulators.https://openai.com/index/video-generation-models-as-world-simulators/, 2024

OpenAI. Video generation models as world simulators.https://openai.com/index/video-generation-models-as-world-simulators/, 2024

2024
[29]

Open-sora 2.0: Training a commercial-level video generation model in $200k

Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, et al. Open-sora 2.0: Training a commercial-level video generation model in200k.arXiv preprint arXiv:2503.09642, 2025

work page arXiv 2025
[30]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review arXiv 2024
[31]

Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022

Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022

2022
[32]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.CoRR, abs/2502.09992, 2025

work page internal anchor Pith review arXiv 2025
[33]

Hawley, and Jordi Pons

Zach Evans, CJ Carr, Josiah Taylor, Scott H. Hawley, and Jordi Pons. Fast timing-conditioned latent audio diffusion. In Forty-first International Conference on Machine Learning, ICML 2024, 2024

2024
[34]

Parker, CJ Carr, Zachary Zukowski, Josiah Taylor, and Jordi Pons

Zach Evans, Julian D. Parker, CJ Carr, Zachary Zukowski, Josiah Taylor, and Jordi Pons. Long-form music generation with latent diffusion. InProceedings of the 25th International Society for Music Information Retrieval Conference, ISMIR 2024, San Francisco, California, USA and Online, November 10-14, pages 429–437, 2024

2024
[35]

Diffwave: A versatile diffusion model for audio synthesis

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. InInternational Conference on Learning Representations, 2021

2021
[36]

Grad-tts: A diffusion probabilistic model for text-to-speech

Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. Grad-tts: A diffusion probabilistic model for text-to-speech. InInternational conference on machine learning, pages 8599–8608. PMLR, 2021

2021
[37]

Diff-tts: A denoising diffusion model for text-to-speech.arXiv preprint arXiv:2104.01409, 2021

Myeonghun Jeong, Hyeongju Kim, Sung Jun Cheon, Byoung Jin Choi, and Nam Soo Kim. Diff-tts: A denoising diffusion model for text-to-speech.arXiv preprint arXiv:2104.01409, 2021

work page arXiv 2021
[38]

Prodiff: Progressive fast diffusion model for high-quality text-to-speech

Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, and Yi Ren. Prodiff: Progressive fast diffusion model for high-quality text-to-speech. InProceedings of the 30th ACM International Conference on Multimedia, pages 2595–2605, 2022

2022
[39]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 4172–4182. IEEE, 2023

2023
[40]

Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers.arXiv preprint arXiv:2304.09116, 2023

Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, and Jiang Bian. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers.arXiv preprint arXiv:2304.09116, 2023

work page arXiv 2023
[41]

Voicebox: Text-guided multilingual universal speech generation at scale.Advances in neural information processing systems, 36:14005–14034, 2023

Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, et al. Voicebox: Text-guided multilingual universal speech generation at scale.Advances in neural information processing systems, 36:14005–14034, 2023

2023
[42]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.CoRR, abs/2207.12598, 2022

work page internal anchor Pith review arXiv 2022
[43]

E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts

Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, et al. E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts. In2024 IEEE spoken language technology workshop (SLT), pages 682–689. IEEE, 2024

2024
[44]

Diffa: Large language diffusion models can listen and understand.arXiv preprint arXiv:2507.18452, 2025

Jiaming Zhou, Hongjie Chen, Shiwan Zhao, Jian Kang, Jie Li, Enzhi Wang, Yujie Guo, Haoqin Sun, Hui Wang, Aobo Kong, et al. Diffa: Large language diffusion models can listen and understand.arXiv preprint arXiv:2507.18452, 2025

work page arXiv 2025
[45]

Scaling Laws for Neural Language Models

Jared Kaplan et al. Scaling laws for neural language models.CoRR, abs/2001.08361, 2020

work page internal anchor Pith review arXiv 2001
[46]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page internal anchor Pith review arXiv 2022
[47]

Distillation scaling laws

Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russell Webb. Distillation scaling laws. InForty-second International Conference on Machine Learning, ICML 2025, volume 267 ofProceedings of Machine Learning Research, 2025

2025
[48]

Scaling vision transformers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022

2022
[49]

Reproducible scaling laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2818–2829, 2023

2023
[50]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

2021
[51]

Scaling laws for generative mixed-modal language models

Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for generative mixed-modal language models. InInternational Conference on Machine Learning, pages 265–279. PMLR, 2023

2023
[52]

Scaling behavior of discrete diffusion language models.CoRR, abs/2512.10858, 2025

Dimitri von Rütte, Janis Fluri, Omead Pooladzandi, Bernhard Schölkopf, Thomas Hofmann, and Antonio Orvieto. Scaling behavior of discrete diffusion language models.CoRR, abs/2512.10858, 2025

work page arXiv 2025
[53]

Louis Bethune, Victor Turrisi, Bruno Kacper Mlodozeniec, Pau Rodriguez Lopez, Lokesh Boominathan, Nikhil Bhendawade, Amitis Shidani, Joris Pelemans, Theo X. Olausson, Devon Hjelm, Paul Dixon, Joao Monteiro, Pierre Ablin, Vishnu Banna, 14 Arno Blaas, Nick Henderson, Kari Noriy, Dan Busbridge, Josh Susskind, Marco Cuturi, Irina Belousova, Luca Zappella, Rus...

2026
[54]

Whisperx: Time-accurate speech transcription of long-form audio.arXiv preprint arXiv:2303.00747, 2023

Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. Whisperx: Time-accurate speech transcription of long-form audio.arXiv preprint arXiv:2303.00747, 2023

work page arXiv 2023
[55]

Robust speech recognition via large-scale weak supervision

Alec Radford et al. Robust speech recognition via large-scale weak supervision. InICML, pages 28492–28518. PMLR, 2023

2023
[56]

dmel: Speech tokenization made simple.arXiv preprint arXiv:2407.15835, 2024

Richard He Bai, Tatiana Likhomanenko, Ruixiang Zhang, Zijin Gu, Zakaria Aldeneh, and Navdeep Jaitly. dmel: Speech tokenization made simple.arXiv preprint arXiv:2407.15835, 2024

work page arXiv 2024
[57]

Probing the robustness properties of neural speech codecs

Wei-Cheng Tseng and David Harwath. Probing the robustness properties of neural speech codecs. InProc. Interspeech 2025, pages 5013–5017, 2025

2025
[58]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InThe Tenth International Conference on Learning Representations, ICLR, 2022

2022
[59]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.CoRR, abs/2511.13720, 2025

work page internal anchor Pith review arXiv 2025
[60]

Efficient diffusion training via min-snr weighting strategy

Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffusion training via min-snr weighting strategy. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023
[61]

Guiding a diffusion model with a bad version of itself

Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural ...

2024
[62]

Susskind

Arwen Bradley, Preetum Nakkiran, David Berthelot, James Thornton, and Joshua M. Susskind. Mechanisms of projective composition of diffusion models. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Forty-second International Conference on Machine Learning, ICML 2025, volu...

2025
[63]

Audiolm: A language modeling approach to audio genera- tion.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2523–2533, 2023

Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. Audiolm: A language modeling approach to audio genera- tion.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2523–2533, 2023

2023
[64]

Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis.Advances in neural information processing systems, 33:17022–17033, 2020

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis.Advances in neural information processing systems, 33:17022–17033, 2020

2020
[65]

Universal phone recognition with a multilingual allophone system

Xinjian Li et al. Universal phone recognition with a multilingual allophone system. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020

2020
[66]

Allosaurus

Xinjian Li. Allosaurus. GitHub repository, 2020. URLhttps://github.com/xinjli/allosaurus. Accessed 2026-01-27

2020
[67]

Divergence measures based on the shannon entropy.IEEE Transactions on Information Theory, 37(1):145–151, 1991

Jianhua Lin. Divergence measures based on the shannon entropy.IEEE Transactions on Information Theory, 37(1):145–151, 1991

1991
[68]

Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound.arXiv preprint arXiv:2502.05139,

Andros Tjandra et al. Meta audiobox aesthetics.CoRR, abs/2502.05139, 2025

work page arXiv 2025
[69]

Chandan K. A. Reddy, Vishak Gopal, and Ross Cutler. DNSMOS: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021

2021
[70]

An open source implementation of ITU-T recommendation P.808 with validation.CoRR, abs/2005.08138, 2020

Babak Naderi and Ross Cutler. An open source implementation of ITU-T recommendation P.808 with validation.CoRR, abs/2005.08138, 2020

work page arXiv 2005
[71]

Chandan K. A. Reddy, Vishak Gopal, and Ross Cutler. DNSMOS P.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022

2022
[72]

Crowdsourcing subjective evaluation of noise suppression algorithms in speech using ITU-T P.835 with validation.CoRR, abs/2010.13200, 2020

Babak Naderi and Ross Cutler. Crowdsourcing subjective evaluation of noise suppression algorithms in speech using ITU-T P.835 with validation.CoRR, abs/2010.13200, 2020

work page arXiv 2010
[73]

NISQA: A deep CNN-self-attention model for multidi- mensional speech quality prediction with crowdsourced datasets

Anika Mittag, Babak Naderi, Assmaa Chehadi, and Sebastian Möller. NISQA: A deep CNN-self-attention model for multidi- mensional speech quality prediction with crowdsourced datasets. InProc. Interspeech, 2021

2021
[74]

Data and parameter scaling laws for neural machine translation

Mitchell A Gordon, Kevin Duh, and Jared Kaplan. Data and parameter scaling laws for neural machine translation. InACL Rolling Review - May, 2021

2021
[75]

Scaling data-constrained language models.Advances in Neural Information Processing Systems, 36:50358–50376, 2023

Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A Raffel. Scaling data-constrained language models.Advances in Neural Information Processing Systems, 36:50358–50376, 2023

2023
[76]

A hitchhiker’s guide to scaling law estimation

Leshem Choshen, Yang Zhang, and Jacob Andreas. A hitchhiker’s guide to scaling law estimation. InForty-second Interna- tional Conference on Machine Learning, 2025

2025
[77]

Scaling laws for downstream task performance in machine translation

Berivan Isik, Natalia Ponomareva, Hussein Hazimeh, Dimitris Paparas, Sergei Vassilvitskii, and Sanmi Koyejo. Scaling laws for downstream task performance in machine translation. InThe Thirteenth International Conference on Learning Represen- tations, ICLR 2025, 2025

2025
[78]

Revisiting the scaling properties of downstream metrics in large language model training.CoRR, abs/2512.08894, 2025

Jakub Krajewski, Amitis Shidani, Dan Busbridge, Sam Wiseman, and Jason Ramapuram. Revisiting the scaling properties of downstream metrics in large language model training.CoRR, abs/2512.08894, 2025

work page arXiv 2025
[79]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavar- 1...

work page internal anchor Pith review arXiv 2021
[80]

Smith, Dirk Groeneveld, Pang Wei Koh, Jesse Dodge, and Hannaneh Hajishirzi

Akshita Bhagia, Jiacheng Liu, Alexander Wettig, David Heineman, Oyvind Tafjord, Ananya Harsh Jha, Luca Soldaini, Noah A. Smith, Dirk Groeneveld, Pang Wei Koh, Jesse Dodge, and Hannaneh Hajishirzi. Establishing task scaling laws via compute- efficient model ladders.CoRR, abs/2412.04403, 2024

work page arXiv 2024
[81]

Scaling laws for diffusion transformers.CoRR, abs/2410.08184, 2024

Zhengyang Liang, Hao He, Ceyuan Yang, and Bo Dai. Scaling laws for diffusion transformers.CoRR, abs/2410.08184, 2024

work page arXiv 2024

Showing first 80 references.