Recognition: unknown
Scaling Properties of Continuous Diffusion Spoken Language Models
Pith reviewed 2026-05-08 03:39 UTC · model grok-4.3
The pith
Continuous diffusion spoken language models follow the same scaling laws as autoregressive models, reaching emotive multi-speaker output at 16 billion parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Continuous diffusion spoken language models exhibit scaling laws for validation loss and phoneme Jensen-Shannon divergence that mirror autoregressive behavior. The optimal token-to-parameter ratio decreases with increasing compute, yet loss itself becomes insensitive to the precise allocation of data versus parameters at large scale. Scaling to 16 billion parameters on tens of millions of hours of conversational data enables generation of emotive, prosodic, multi-speaker, multilingual speech, although long-form coherence remains a significant challenge.
What carries the argument
The phoneme Jensen-Shannon divergence (pJSD) metric, which computes divergence between phoneme probability distributions of generated and reference speech to assess linguistic quality.
If this is right
- Validation loss and pJSD both improve according to power laws as model size and data increase.
- The optimal ratio of tokens to parameters declines as total compute grows.
- At large scales the loss becomes largely independent of exact data-to-parameter splits, opening routes to faster inference.
- Models at 16 billion parameters trained on tens of millions of hours can synthesize speech carrying emotion, prosody, speaker identity, and language variety.
Where Pith is reading between the lines
- Continuous diffusion may bypass the information bottleneck created by discretizing raw audio waveforms.
- The observed loss insensitivity could allow practitioners to trade model size for data volume without retraining from scratch.
- Long-form coherence failures point to a need for explicit mechanisms that maintain context across many seconds of speech.
- These scaling behaviors suggest hybrid continuous-diffusion and text-model pipelines could become practical for real-time spoken dialogue systems.
Load-bearing premise
The newly introduced phoneme Jensen-Shannon divergence metric reliably measures the linguistic quality and coherence of generated speech.
What would settle it
Training a 16-billion-parameter model that achieves low pJSD yet produces speech that human raters consistently judge as linguistically incoherent over long sequences would falsify the central claim.
read the original abstract
Speech-only spoken language models (SLMs) lag behind text and text-speech models in performance, with recent discrete autoregressive (AR) SLMs indicating significant computational and data demands to match text models. Since discretizing continuous speech for AR creates bottlenecks, we explore whether continuous diffusion (CD) SLM is more viable. To quantify the SLMs linguistic quality, we introduce the phoneme Jensen-Shannon divergence (pJSD) metric. Our analysis reveals CD SLMs, mirroring AR behavior, exhibit scaling laws for validation loss and pJSD, and show optimal token-to-parameter ratios decreasing as compute scales. However, for the latter, loss becomes insensitive to choice of data and model sizes, showing potential for fast inference. Scaling CD SLMs to 16B parameters with tens of millions of hours of conversational data enables generation of emotive, prosodic, multi-speaker, multilingual speech, though achieving long-form coherence remains a significant challenge.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper explores continuous diffusion spoken language models (CD SLMs) as an alternative to discrete autoregressive SLMs, which suffer from discretization bottlenecks. It introduces the phoneme Jensen-Shannon divergence (pJSD) metric to measure linguistic quality and reports that CD SLMs exhibit scaling laws in validation loss and pJSD similar to AR models. Optimal token-to-parameter ratios decrease with scale, but loss becomes insensitive to data/model size choices, suggesting fast inference potential. Scaling to 16B parameters on tens of millions of hours of conversational data enables emotive, prosodic, multi-speaker, multilingual speech generation, though long-form coherence remains unsolved.
Significance. If the scaling observations and pJSD-based claims hold, the work suggests CD SLMs could match or exceed AR SLMs in capability while offering computational advantages for inference. The empirical scaling laws and new metric provide a foundation for future speech-only model development, particularly in data-efficient regimes. Credit is due for the large-scale empirical analysis comparing to prior AR results and for highlighting both capabilities and remaining challenges like long-form coherence.
major comments (2)
- [Abstract] Abstract: The central claims that scaling CD SLMs to 16B parameters enables emotive/prosodic/multi-speaker/multilingual speech and that pJSD tracks linguistic quality/coherence rest on pJSD decreasing with model scale. However, pJSD is defined solely as Jensen-Shannon divergence over phoneme distributions from generated vs. reference speech, with no reported correlation to human judgments, WER, prosody metrics, or long-form coherence measures. This makes it unclear whether observed scaling laws support the headline generation capabilities or primarily reflect local phoneme statistics.
- [Abstract] Abstract: The statement that 'loss becomes insensitive to choice of data and model sizes, showing potential for fast inference' is presented as a key scaling property, but without details on the experimental setup (e.g., specific model sizes, data regimes, or how insensitivity was quantified), it is difficult to assess whether this holds beyond the tested range or generalizes to the 16B scale.
minor comments (2)
- [Abstract] The abstract mentions 'tens of millions of hours of conversational data' but provides no breakdown of data sources, languages covered, or preprocessing steps, which would aid reproducibility.
- [Abstract] No error bars, ablation details, or data splits are referenced in the high-level claims, making it hard to evaluate robustness of the scaling observations.
Simulated Author's Rebuttal
We thank the referee for the constructive review and positive assessment of the paper's significance. We address each major comment below with proposed clarifications to the abstract.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims that scaling CD SLMs to 16B parameters enables emotive/prosodic/multi-speaker/multilingual speech and that pJSD tracks linguistic quality/coherence rest on pJSD decreasing with model scale. However, pJSD is defined solely as Jensen-Shannon divergence over phoneme distributions from generated vs. reference speech, with no reported correlation to human judgments, WER, prosody metrics, or long-form coherence measures. This makes it unclear whether observed scaling laws support the headline generation capabilities or primarily reflect local phoneme statistics.
Authors: We appreciate this clarification request. pJSD is introduced as a phoneme-level distributional metric to quantify linguistic fidelity, chosen because it directly captures divergence in phoneme usage between generated and reference speech and exhibits scaling behavior consistent with validation loss and prior AR SLM results. The 16B-scale generation claims for emotive, prosodic, multi-speaker, and multilingual output are grounded in the qualitative examples and demonstrations provided in the paper, while pJSD supplies supporting quantitative evidence of improved phoneme statistics. We acknowledge that direct correlations to human judgments, WER, or long-form coherence are not reported. We will revise the abstract to more clearly separate the pJSD scaling results (tied to local phoneme quality) from the qualitative generation capabilities at scale. revision: partial
-
Referee: [Abstract] Abstract: The statement that 'loss becomes insensitive to choice of data and model sizes, showing potential for fast inference' is presented as a key scaling property, but without details on the experimental setup (e.g., specific model sizes, data regimes, or how insensitivity was quantified), it is difficult to assess whether this holds beyond the tested range or generalizes to the 16B scale.
Authors: We agree that the abstract would benefit from additional context on this point. The insensitivity observation derives from our compute-optimal scaling experiments, in which validation loss was measured across varying token-to-parameter ratios at fixed FLOPs budgets for model sizes from ~100M to multi-billion parameters and corresponding data volumes. Beyond the optimal ratio, further increases in data or model size yielded negligible loss improvements. We will revise the abstract to briefly reference these experimental conditions and direct readers to the relevant scaling analysis section for the exact model sizes, data regimes, and quantification approach. revision: yes
Circularity Check
No significant circularity; scaling observations are empirical
full rationale
The paper introduces the pJSD metric and reports empirical scaling trends for validation loss and pJSD as model size and data increase, along with token-to-parameter ratio observations. These are presented as direct measurements on trained models rather than any derivation, prediction, or fitted quantity that reduces to its own inputs by construction. No equations, self-definitional steps, or load-bearing self-citations are used to justify the core claims; the analysis explicitly notes remaining challenges like long-form coherence and compares trends to prior AR work without circular reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Self-supervised speech representation learning: A review.IEEE Journal of Selected Topics in Signal Processing, 16(6):1179–1210, 2022
Abdelrahman Mohamed, Hung-yi Lee, Lasse Borgholt, Jakob D Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang- Wen Li, Karen Livescu, Lars Maaløe, et al. Self-supervised speech representation learning: A review.IEEE Journal of Selected Topics in Signal Processing, 16(6):1179–1210, 2022
2022
-
[2]
wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020
2020
-
[3]
Self-supervised speech representations are more phonetic than semantic
Kwanghee Choi, Ankita Pasad, Tomohiko Nakamura, Satoru Fukayama, Karen Livescu, and Shinji Watanabe. Self-supervised speech representations are more phonetic than semantic. InProc. Interspeech, pages 4578–4582, 2024
2024
-
[4]
On the landscape of spoken language models: A comprehensive survey.Transactions on Machine Learning Research, 2025
Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung yi Lee, Karen Livescu, and Shinji Watanabe. On the landscape of spoken language models: A comprehensive survey.Transactions on Machine Learning Research, 2025. ISSN 2835-8856
2025
-
[5]
On generative spoken language modeling from raw audio.Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021
Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, et al. On generative spoken language modeling from raw audio.Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021. 12
2021
-
[6]
The zero resource speech challenge 2021: Spoken language modelling.Interspeech 2021, 2021
Ewan Dunbar, Mathieu Bernard, Nicolas Hamilakis, Tu Anh Nguyen, Maureen de Seyssel, Patricia Rozé, Morgane Rivière, Eu- gene Kharitonov, and Emmanuel Dupoux. The zero resource speech challenge 2021: Spoken language modelling.Interspeech 2021, 2021
2021
-
[7]
Text-free prosody-aware generative spoken language modeling
Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu-Anh Nguyen, Morgane Riviere, Abdelrahman Mohamed, Emmanuel Dupoux, et al. Text-free prosody-aware generative spoken language modeling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8666– 8681, 2022
2022
-
[8]
Scaling properties of speech language models
Santiago Cuervo and Ricard Marxer. Scaling properties of speech language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 351–361, 2024
2024
-
[9]
Scaling analysis of interleaved speech-text language models
Gallil Maimon, Michael Hassid, Amit Roth, and Yossi Adi. Scaling analysis of interleaved speech-text language models. In Second Conference on Language Modeling, 2025
2025
-
[10]
Spidr: Learning fast and stable linguistic units for spoken language models without supervision.Transactions on Machine Learning Research, 2025
Maxime Poli, Mahi Luthra, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize, Jiayi Shen, Robin Algayres, Yu-An Chung, Mido Assran, Juan Pino, and Emmanuel Dupoux. Spidr: Learning fast and stable linguistic units for spoken language models without supervision.Transactions on Machine Learning Research, 2025. ISSN 2835-8856
2025
-
[11]
Tensor programs iv: Feature learning in infinite-width neural networks
Greg Yang and Edward J Hu. Tensor programs iv: Feature learning in infinite-width neural networks. InInternational Conference on Machine Learning, pages 11727–11737. PMLR, 2021
2021
-
[12]
Don’t be lazy: Completep enables compute-efficient deep transformers
Nolan Simran Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness. Don’t be lazy: Completep enables compute-efficient deep transformers. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[13]
From image to video: An empirical study of diffusion representations
Pedro Vélez, Luisa F Polanía, Yi Yang, Chuhan Zhang, Rishabh Kabra, Anurag Arnab, and Mehdi SM Sajjadi. From image to video: An empirical study of diffusion representations. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16948–16958, 2025
2025
-
[14]
Diffusion models in vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 45(9):10850–10869, 2023
Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 45(9):10850–10869, 2023
2023
-
[15]
Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022
2022
-
[16]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
2022
-
[17]
Autoregressive speech synthesis without vector quantization
Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, et al. Autoregressive speech synthesis without vector quantization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1287–1300, 2025
2025
-
[18]
Continuous audio language models
Simon Rouard, Manu Orsini, Axel Roebel, Neil Zeghidour, and Alexandre Défossez. Continuous audio language models. arXiv preprint arXiv:2509.06926, 2025
-
[19]
Textually pretrained speech language models.Advances in Neural Information Processing Systems, 36:63483–63501, 2023
Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux, et al. Textually pretrained speech language models.Advances in Neural Information Processing Systems, 36:63483–63501, 2023
2023
-
[20]
EmergentTTS-eval: Evaluating TTS models on complex prosodic, expressiveness, and linguistic challenges using model-as-a-judge
Ruskin Raj Manku, Yuzhi Tang, Xingjian Shi, Mu Li, and Alex Smola. EmergentTTS-eval: Evaluating TTS models on complex prosodic, expressiveness, and linguistic challenges using model-as-a-judge. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025
2025
-
[22]
Discrete audio tokens: More than a survey!Transactions on Machine Learning Research, 2025
Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung yi Lee, Shinji Watanabe, Yossi Adi, and Mirco Ravanelli. Discrete audio tokens: More than a sur...
2025
-
[23]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems, volume 33, pages 6840–6851, 2020
2020
-
[24]
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.CoRR, abs/1907.05600, 2019
-
[25]
Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based gen- erative modeling through stochastic differential equations. In9th International Conference on Learning Representations, ICLR 2021, 2021
2021
-
[26]
Diffusion models beat GANs on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. InAdvances in Neural Information Processing Systems, volume 34, 2021
2021
-
[27]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learning, ICML 2024, 2024. 13
2024
-
[28]
Video generation models as world simulators.https://openai.com/index/video-generation-models-as-world-simulators/, 2024
OpenAI. Video generation models as world simulators.https://openai.com/index/video-generation-models-as-world-simulators/, 2024
2024
-
[29]
Open-sora 2.0: Training a commercial-level video generation model in $200k
Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, et al. Open-sora 2.0: Training a commercial-level video generation model in200k.arXiv preprint arXiv:2503.09642, 2025
-
[30]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024
work page internal anchor Pith review arXiv 2024
-
[31]
Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022
Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022
2022
-
[32]
Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.CoRR, abs/2502.09992, 2025
work page internal anchor Pith review arXiv 2025
-
[33]
Hawley, and Jordi Pons
Zach Evans, CJ Carr, Josiah Taylor, Scott H. Hawley, and Jordi Pons. Fast timing-conditioned latent audio diffusion. In Forty-first International Conference on Machine Learning, ICML 2024, 2024
2024
-
[34]
Parker, CJ Carr, Zachary Zukowski, Josiah Taylor, and Jordi Pons
Zach Evans, Julian D. Parker, CJ Carr, Zachary Zukowski, Josiah Taylor, and Jordi Pons. Long-form music generation with latent diffusion. InProceedings of the 25th International Society for Music Information Retrieval Conference, ISMIR 2024, San Francisco, California, USA and Online, November 10-14, pages 429–437, 2024
2024
-
[35]
Diffwave: A versatile diffusion model for audio synthesis
Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. InInternational Conference on Learning Representations, 2021
2021
-
[36]
Grad-tts: A diffusion probabilistic model for text-to-speech
Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. Grad-tts: A diffusion probabilistic model for text-to-speech. InInternational conference on machine learning, pages 8599–8608. PMLR, 2021
2021
-
[37]
Diff-tts: A denoising diffusion model for text-to-speech.arXiv preprint arXiv:2104.01409, 2021
Myeonghun Jeong, Hyeongju Kim, Sung Jun Cheon, Byoung Jin Choi, and Nam Soo Kim. Diff-tts: A denoising diffusion model for text-to-speech.arXiv preprint arXiv:2104.01409, 2021
-
[38]
Prodiff: Progressive fast diffusion model for high-quality text-to-speech
Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, and Yi Ren. Prodiff: Progressive fast diffusion model for high-quality text-to-speech. InProceedings of the 30th ACM International Conference on Multimedia, pages 2595–2605, 2022
2022
-
[39]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 4172–4182. IEEE, 2023
2023
-
[40]
Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, and Jiang Bian. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers.arXiv preprint arXiv:2304.09116, 2023
-
[41]
Voicebox: Text-guided multilingual universal speech generation at scale.Advances in neural information processing systems, 36:14005–14034, 2023
Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, et al. Voicebox: Text-guided multilingual universal speech generation at scale.Advances in neural information processing systems, 36:14005–14034, 2023
2023
-
[42]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.CoRR, abs/2207.12598, 2022
work page internal anchor Pith review arXiv 2022
-
[43]
E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts
Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, et al. E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts. In2024 IEEE spoken language technology workshop (SLT), pages 682–689. IEEE, 2024
2024
-
[44]
Jiaming Zhou, Hongjie Chen, Shiwan Zhao, Jian Kang, Jie Li, Enzhi Wang, Yujie Guo, Haoqin Sun, Hui Wang, Aobo Kong, et al. Diffa: Large language diffusion models can listen and understand.arXiv preprint arXiv:2507.18452, 2025
-
[45]
Scaling Laws for Neural Language Models
Jared Kaplan et al. Scaling laws for neural language models.CoRR, abs/2001.08361, 2020
work page internal anchor Pith review arXiv 2001
-
[46]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...
work page internal anchor Pith review arXiv 2022
-
[47]
Distillation scaling laws
Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russell Webb. Distillation scaling laws. InForty-second International Conference on Machine Learning, ICML 2025, volume 267 ofProceedings of Machine Learning Research, 2025
2025
-
[48]
Scaling vision transformers
Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022
2022
-
[49]
Reproducible scaling laws for contrastive language-image learning
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2818–2829, 2023
2023
-
[50]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021
2021
-
[51]
Scaling laws for generative mixed-modal language models
Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for generative mixed-modal language models. InInternational Conference on Machine Learning, pages 265–279. PMLR, 2023
2023
-
[52]
Scaling behavior of discrete diffusion language models.CoRR, abs/2512.10858, 2025
Dimitri von Rütte, Janis Fluri, Omead Pooladzandi, Bernhard Schölkopf, Thomas Hofmann, and Antonio Orvieto. Scaling behavior of discrete diffusion language models.CoRR, abs/2512.10858, 2025
-
[53]
Louis Bethune, Victor Turrisi, Bruno Kacper Mlodozeniec, Pau Rodriguez Lopez, Lokesh Boominathan, Nikhil Bhendawade, Amitis Shidani, Joris Pelemans, Theo X. Olausson, Devon Hjelm, Paul Dixon, Joao Monteiro, Pierre Ablin, Vishnu Banna, 14 Arno Blaas, Nick Henderson, Kari Noriy, Dan Busbridge, Josh Susskind, Marco Cuturi, Irina Belousova, Luca Zappella, Rus...
2026
-
[54]
Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. Whisperx: Time-accurate speech transcription of long-form audio.arXiv preprint arXiv:2303.00747, 2023
-
[55]
Robust speech recognition via large-scale weak supervision
Alec Radford et al. Robust speech recognition via large-scale weak supervision. InICML, pages 28492–28518. PMLR, 2023
2023
-
[56]
dmel: Speech tokenization made simple.arXiv preprint arXiv:2407.15835, 2024
Richard He Bai, Tatiana Likhomanenko, Ruixiang Zhang, Zijin Gu, Zakaria Aldeneh, and Navdeep Jaitly. dmel: Speech tokenization made simple.arXiv preprint arXiv:2407.15835, 2024
-
[57]
Probing the robustness properties of neural speech codecs
Wei-Cheng Tseng and David Harwath. Probing the robustness properties of neural speech codecs. InProc. Interspeech 2025, pages 5013–5017, 2025
2025
-
[58]
Progressive distillation for fast sampling of diffusion models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InThe Tenth International Conference on Learning Representations, ICLR, 2022
2022
-
[59]
Back to Basics: Let Denoising Generative Models Denoise
Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.CoRR, abs/2511.13720, 2025
work page internal anchor Pith review arXiv 2025
-
[60]
Efficient diffusion training via min-snr weighting strategy
Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffusion training via min-snr weighting strategy. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023
2023
-
[61]
Guiding a diffusion model with a bad version of itself
Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural ...
2024
-
[62]
Susskind
Arwen Bradley, Preetum Nakkiran, David Berthelot, James Thornton, and Joshua M. Susskind. Mechanisms of projective composition of diffusion models. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Forty-second International Conference on Machine Learning, ICML 2025, volu...
2025
-
[63]
Audiolm: A language modeling approach to audio genera- tion.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2523–2533, 2023
Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. Audiolm: A language modeling approach to audio genera- tion.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2523–2533, 2023
2023
-
[64]
Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis.Advances in neural information processing systems, 33:17022–17033, 2020
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis.Advances in neural information processing systems, 33:17022–17033, 2020
2020
-
[65]
Universal phone recognition with a multilingual allophone system
Xinjian Li et al. Universal phone recognition with a multilingual allophone system. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020
2020
-
[66]
Allosaurus
Xinjian Li. Allosaurus. GitHub repository, 2020. URLhttps://github.com/xinjli/allosaurus. Accessed 2026-01-27
2020
-
[67]
Divergence measures based on the shannon entropy.IEEE Transactions on Information Theory, 37(1):145–151, 1991
Jianhua Lin. Divergence measures based on the shannon entropy.IEEE Transactions on Information Theory, 37(1):145–151, 1991
1991
-
[68]
Andros Tjandra et al. Meta audiobox aesthetics.CoRR, abs/2502.05139, 2025
-
[69]
Chandan K. A. Reddy, Vishak Gopal, and Ross Cutler. DNSMOS: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021
2021
-
[70]
Babak Naderi and Ross Cutler. An open source implementation of ITU-T recommendation P.808 with validation.CoRR, abs/2005.08138, 2020
-
[71]
Chandan K. A. Reddy, Vishak Gopal, and Ross Cutler. DNSMOS P.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022
2022
-
[72]
Babak Naderi and Ross Cutler. Crowdsourcing subjective evaluation of noise suppression algorithms in speech using ITU-T P.835 with validation.CoRR, abs/2010.13200, 2020
-
[73]
NISQA: A deep CNN-self-attention model for multidi- mensional speech quality prediction with crowdsourced datasets
Anika Mittag, Babak Naderi, Assmaa Chehadi, and Sebastian Möller. NISQA: A deep CNN-self-attention model for multidi- mensional speech quality prediction with crowdsourced datasets. InProc. Interspeech, 2021
2021
-
[74]
Data and parameter scaling laws for neural machine translation
Mitchell A Gordon, Kevin Duh, and Jared Kaplan. Data and parameter scaling laws for neural machine translation. InACL Rolling Review - May, 2021
2021
-
[75]
Scaling data-constrained language models.Advances in Neural Information Processing Systems, 36:50358–50376, 2023
Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A Raffel. Scaling data-constrained language models.Advances in Neural Information Processing Systems, 36:50358–50376, 2023
2023
-
[76]
A hitchhiker’s guide to scaling law estimation
Leshem Choshen, Yang Zhang, and Jacob Andreas. A hitchhiker’s guide to scaling law estimation. InForty-second Interna- tional Conference on Machine Learning, 2025
2025
-
[77]
Scaling laws for downstream task performance in machine translation
Berivan Isik, Natalia Ponomareva, Hussein Hazimeh, Dimitris Paparas, Sergei Vassilvitskii, and Sanmi Koyejo. Scaling laws for downstream task performance in machine translation. InThe Thirteenth International Conference on Learning Represen- tations, ICLR 2025, 2025
2025
-
[78]
Jakub Krajewski, Amitis Shidani, Dan Busbridge, Sam Wiseman, and Jason Ramapuram. Revisiting the scaling properties of downstream metrics in large language model training.CoRR, abs/2512.08894, 2025
-
[79]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavar- 1...
work page internal anchor Pith review arXiv 2021
-
[80]
Smith, Dirk Groeneveld, Pang Wei Koh, Jesse Dodge, and Hannaneh Hajishirzi
Akshita Bhagia, Jiacheng Liu, Alexander Wettig, David Heineman, Oyvind Tafjord, Ananya Harsh Jha, Luca Soldaini, Noah A. Smith, Dirk Groeneveld, Pang Wei Koh, Jesse Dodge, and Hannaneh Hajishirzi. Establishing task scaling laws via compute- efficient model ladders.CoRR, abs/2412.04403, 2024
-
[81]
Scaling laws for diffusion transformers.CoRR, abs/2410.08184, 2024
Zhengyang Liang, Hao He, Ceyuan Yang, and Bo Dai. Scaling laws for diffusion transformers.CoRR, abs/2410.08184, 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.