Recognition: 2 theorem links
· Lean TheoremMinimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
Pith reviewed 2026-05-11 00:44 UTC · model grok-4.3
The pith
Speech LLMs can close their modality gap by feeding prosody embeddings alongside text tokens from the input side.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TextPro-SLM is built by combining WhisperPro, a unified speech encoder that outputs synchronized text tokens and prosody embeddings, with an LLM backbone trained to retain the semantic capabilities of its original text checkpoint while acquiring paralinguistic understanding. This input-side design yields the lowest modality gap observed among leading SLMs at both 3B and 7B scales and competitive performance on paralinguistic tasks, using only roughly 1,000 hours of audio for the LLM training stage.
What carries the argument
WhisperPro, the unified speech encoder that produces synchronized text tokens and prosody embeddings from audio so that spoken input resembles prosody-aware text for the LLM backbone.
If this is right
- TextPro-SLM records the lowest modality gap among leading SLMs at both 3B and 7B scales.
- The model maintains strong performance on tasks that require paralinguistic understanding.
- The improvements are obtained after training the LLM component on only roughly 1,000 hours of audio.
- Addressing the input side reduces the gap more effectively than prior methods that adjusted only the output side.
Where Pith is reading between the lines
- Any new text LLM could be adapted into a speech model with comparatively small amounts of additional audio data.
- The explicit separation of prosody signals might support new tasks such as style-controlled generation or emotion-aware responses.
- The same input representation could improve robustness to accents or speaking styles that differ from the training distribution.
- Combining the input-side approach with existing output-side techniques could narrow the remaining gap still further.
Load-bearing premise
The prosody embeddings from WhisperPro can be added to the LLM input without degrading its semantic performance and that this addition is what produces the measured reduction in modality gap.
What would settle it
An ablation in which TextPro-SLM without the prosody embeddings shows no increase in modality gap and no drop on paralinguistic tasks would indicate that the input-side prosody mechanism is not responsible for the reported gains.
Figures
read the original abstract
Speech large language models (SLMs) are typically built from text large language model (TLM) checkpoints, yet they still suffer from a substantial modality gap. Prior work has mainly attempted to reduce this gap from the output side by making speech generation more text-like, but the gap remains. We argue that the key remaining bottleneck lies on the input side. We propose TextPro-SLM, an SLM that makes spoken input more closely resemble that of a prosody-aware text LLM. TextPro-SLM combines WhisperPro, a unified speech encoder that produces synchronized text tokens and prosody embeddings, with an LLM backbone trained to preserve the semantic capabilities of the original TLM while learning paralinguistic understanding. Experiments show that TextPro-SLM achieves the lowest modality gap among leading SLMs at both 3B and 7B scales, while also delivering strong overall performance on paralinguistic understanding tasks. These gains are achieved with only roughly 1,000 hours of LLM training audio, suggesting that reducing the modality gap from the input side is both effective and data-efficient.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TextPro-SLM, an SLM constructed from TLM checkpoints that reduces the modality gap on the input side. It uses WhisperPro, a unified speech encoder producing synchronized text tokens and prosody embeddings, paired with an LLM backbone trained to retain original semantic capabilities while acquiring paralinguistic understanding. At both 3B and 7B scales, TextPro-SLM reports the lowest modality gap among compared SLMs, competitive or superior paralinguistic task performance, and these results are obtained with only ~1,000 hours of LLM training audio.
Significance. If the reported reductions in modality gap and preservation of semantic performance hold under the described evaluation protocol, the work demonstrates that input-side prosody integration can be more data-efficient than output-side alignment strategies. The explicit scaling to 3B/7B models and the emphasis on ~1k-hour training budgets provide a concrete, falsifiable benchmark for future input-side modality-bridging methods in speech-language modeling.
major comments (2)
- [Experiments] Experiments section: the central claim that semantic capabilities of the original TLM are preserved requires explicit reporting of performance on at least one standard text-only LLM benchmark (e.g., MMLU or a held-out text instruction-following set) before and after the prosody-augmented training; without these numbers the 'prosody-aware text LLM' framing remains under-supported.
- [Method / Experiments] The modality-gap metric itself (presumably defined in §3 or §4) is load-bearing for the headline result; the paper should include the exact formula, the reference text embedding space, and error bars across multiple runs or seeds to allow readers to assess whether the reported 'lowest gap' is statistically distinguishable from the next-best baseline.
minor comments (2)
- [Abstract] Abstract: quantitative values for the modality gap (e.g., the actual distance or accuracy delta) and the precise paralinguistic task accuracies should be stated so the abstract can stand alone.
- [Figures / Tables] Figure captions and table headers should explicitly state the number of runs or seeds used for each reported score; this is a minor but necessary clarification for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and recommendation of minor revision. We address each major comment below and will revise the manuscript accordingly to strengthen the evidential support and methodological clarity.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the central claim that semantic capabilities of the original TLM are preserved requires explicit reporting of performance on at least one standard text-only LLM benchmark (e.g., MMLU or a held-out text instruction-following set) before and after the prosody-augmented training; without these numbers the 'prosody-aware text LLM' framing remains under-supported.
Authors: We agree that explicit before-and-after comparisons on standard text-only benchmarks would more robustly substantiate the claim of preserved semantic capabilities. In the revised manuscript we will report results on a held-out text instruction-following evaluation set (and MMLU if space allows) for both the original TLM checkpoint and the fine-tuned TextPro-SLM. This addition will directly address the concern and better support the 'prosody-aware text LLM' description. revision: yes
-
Referee: [Method / Experiments] The modality-gap metric itself (presumably defined in §3 or §4) is load-bearing for the headline result; the paper should include the exact formula, the reference text embedding space, and error bars across multiple runs or seeds to allow readers to assess whether the reported 'lowest gap' is statistically distinguishable from the next-best baseline.
Authors: We thank the referee for underscoring the need for full specification of this central metric. We will add the precise mathematical formula for the modality gap, explicitly identify the reference text embedding space, and report error bars (standard deviations across multiple random seeds) in the revised Experiments section. These changes will enable readers to evaluate whether the lowest-gap result is statistically distinguishable from baselines. revision: yes
Circularity Check
No significant circularity; empirical claims rest on experiments
full rationale
The paper advances an empirical architecture (TextPro-SLM) that fuses a WhisperPro encoder producing synchronized text tokens and prosody embeddings with an LLM backbone. All central claims—lowest modality gap at 3B/7B scales, preserved semantic capability, paralinguistic gains with ~1k hours of audio—are presented as outcomes of training and evaluation protocols rather than as derivations from equations or first-principles results. No equations, fitted parameters renamed as predictions, or self-citation chains that reduce the headline result to its own inputs appear in the provided text. The argument is therefore self-contained and falsifiable via the reported metrics and baselines.
Axiom & Free-Parameter Ledger
invented entities (2)
-
WhisperPro
no independent evidence
-
TextPro-SLM
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearWhisperPro ... produces synchronized text tokens and prosody embeddings ... mel-reconstructor ... L = L_ASR + λ L_mel
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearTextPro-SLM ... makes spoken input more closely resemble that of a prosody-aware text LLM
Reference graph
Works this paper leans on
-
[1]
Recent advances in speech language models: A survey
Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Steven Y Guo, and Irwin King. Recent advances in speech language models: A survey. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13943–13970, 2025
2025
-
[2]
Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. Llama-omni: Seamless speech interaction with large language models.arXiv preprint arXiv:2409.06666, 2024
-
[3]
Moshi: a speech-text foundation model for real-time dialogue
Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024
work page internal anchor Pith review arXiv 2024
-
[4]
Wenqian Cui, Lei Zhu, Xiaohui Li, Zhihan Guo, Haoli Bai, Lu Hou, and Irwin King. Think before you talk: Enhancing meaningful dialogue generation in full-duplex speech language models with planning-inspired text guidance.arXiv preprint arXiv:2508.07375, 2025
-
[5]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Aaron Adcock, Aayushi Srivastava, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pande, Ab- hinav Pandey, Abhinav Sharma, Abhishek Kadian, Abhishek Kumawat, Adam Kelsey, et al. The llama 4 herd: Architecture, training, evaluation, and deployment notes.arXiv preprint arXiv:2601.11659, 2026
-
[8]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024
work page internal anchor Pith review arXiv 2024
-
[11]
Santiago Cuervo, Skyler Seto, Maureen de Seyssel, Richard He Bai, Zijin Gu, Tatiana Likhoma- nenko, Navdeep Jaitly, and Zakaria Aldeneh. Closing the gap between text and speech under- standing in llms.arXiv preprint arXiv:2510.13632, 2025
-
[12]
Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612, 2024
-
[13]
On generative spoken language modeling from raw audio.Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021
Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, et al. On generative spoken language modeling from raw audio.Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021
2021
-
[14]
Textually pretrained speech language models.Advances in Neural Information Processing Systems, 36, 2024
Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux, et al. Textually pretrained speech language models.Advances in Neural Information Processing Systems, 36, 2024
2024
-
[15]
Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R Costa-Jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, et al. Spirit-lm: Interleaved spoken and written language model.arXiv preprint arXiv:2402.05755, 2024. 10
-
[16]
Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities
Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773, Singapore, December 2023. Asso...
-
[17]
Mini-omni: Language models can hear, talk while thinking in streaming,
Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in streaming.arXiv preprint arXiv:2408.16725, 2024
-
[18]
Vita-audio: Fast interleaved audio-text token generation for efficient large speech-language model
Zuwei Long, Yunhang Shen, Chaoyou Fu, Heting Gao, Peixian Chen, Mengdan Zhang, Hang Shao, Jian Li, Jinlong Peng, Haoyu Cao, et al. Vita-audio: Fast interleaved audio-text token generation for efficient large speech-language model. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[19]
Llama-omni 2: Llm- based real-time spoken chatbot with autoregressive streaming speech synthesis
Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, and Yang Feng. Llama-omni 2: Llm- based real-time spoken chatbot with autoregressive streaming speech synthesis. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18617–18629, 2025
2025
-
[20]
Minmo: A multimodal large language model for seamless voice interaction.CoRR, abs/2501.06282, 2025
Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, et al. Minmo: A multimodal large language model for seamless voice interaction.arXiv preprint arXiv:2501.06282, 2025
-
[21]
V ocalnet: Speech llms with multi-token prediction for faster and high-quality generation
Yuhao Wang, Heyang Liu, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, and Yu Wang. V ocalnet: Speech llms with multi-token prediction for faster and high-quality generation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19595–19612, 2025
2025
-
[22]
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report, 2025. URLhttps://arxiv.org/abs/2503.20215
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Understanding the modality gap: An empirical study on the speech-text alignment mechanism of large speech language models
Bajian Xiang, Shuaijiang Zhao, Tingwei Guo, and Wei Zou. Understanding the modality gap: An empirical study on the speech-text alignment mechanism of large speech language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5187–5202, 2025
2025
-
[25]
Soundwave: Less is more for speech-text alignment in llms
Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, and Haizhou Li. Soundwave: Less is more for speech-text alignment in llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18718–18738, 2025
2025
-
[26]
wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020
2020
-
[27]
W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training
Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250. IEEE, 2021
2021
-
[28]
Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021
2021
-
[29]
Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, et al. Kimi-audio technical report.arXiv preprint arXiv:2504.18425, 2025. 11
work page internal anchor Pith review arXiv 2025
-
[30]
Ming-Hao Hsu, Xueyao Zhang, Xiaohai Tian, Jun Zhang, and Zhizheng Wu. Anatomy of the modality gap: Dissecting the internal states of end-to-end speech llms.arXiv preprint arXiv:2603.01502, 2026
-
[31]
Enzhi Wang, Qicheng Li, Zhiyuan Tang, and Yuhang Jia. Cross-modal knowledge distillation for speech large language models.arXiv preprint arXiv:2509.14930, 2025
-
[32]
Di Cao, Dongjie Fu, Hai Yu, Siqi Zheng, Xu Tan, and Tao Jin. X-opd: Cross-modal on-policy distillation for capability alignment in speech llms.arXiv preprint arXiv:2603.24596, 2026
-
[33]
Jing Hu, Danxiang Zhu, Xianlong Luo, Dan Zhang, Shuwei He, Yishu Lei, Haitao Zheng, Shikun Feng, Jingzhou He, Yu Sun, et al. Cord: Bridging the audio-text reasoning gap via weighted on-policy cross-modal distillation.arXiv preprint arXiv:2601.16547, 2026
-
[34]
Runyan Yang, Yuke Si, Yingying Gao, Junlan Feng, Chao Deng, and Shilei Zhang. Teaching audio models to reason: A unified framework for source-and layer-wise distillation.arXiv preprint arXiv:2509.18579, 2025
-
[35]
Hang Shao, Heting Gao, Yunhang Shen, Jiawei Chen, Zuwei Long, Dong Yang, Ke Li, and Xing Sun. Deepomni: Towards seamless and smart speech interaction with adaptive modality-specific moe.arXiv preprint arXiv:2506.21864, 2025
-
[36]
A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications
Siyuan Mu and Sen Lin. A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications.arXiv preprint arXiv:2503.07137, 2025
-
[37]
Chi-Yuan Hsiao, Ke-Han Lu, Kai-Wei Chang, Chih-Kai Yang, Wei-Chih Chen, and Hung-yi Lee. Analyzing mitigation strategies for catastrophic forgetting in end-to-end training of spoken language models.arXiv preprint arXiv:2505.17496, 2025
-
[38]
Dianzhi Yu, Xinni Zhang, Yankai Chen, Aiwei Liu, Yifei Zhang, Philip S Yu, and Irwin King. Recent advances of multimodal continual learning: A comprehensive survey.arXiv preprint arXiv:2410.05352, 2024
-
[39]
Audiolm: a language modeling approach to audio generation.IEEE/ACM transactions on audio, speech, and language processing, 31:2523–2533, 2023
Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al. Audiolm: a language modeling approach to audio generation.IEEE/ACM transactions on audio, speech, and language processing, 31:2523–2533, 2023
2023
-
[40]
Amir Hussein, Sameer Khurana, Gordon Wichern, Francois G Germain, and Jonathan Le Roux. Hasrd: Hierarchical acoustic and semantic representation disentanglement.arXiv preprint arXiv:2506.00843, 2025
-
[41]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023
2023
-
[42]
Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello
Shengpeng Ji, Ziyue Jiang, Wen Wang, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Xize Cheng, Zehan Wang, Ruiqi Li, et al. Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling.arXiv preprint arXiv:2408.16532, 2024
-
[43]
Librispeech: an asr corpus based on public domain audio books
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015
2015
-
[44]
Librispeech-pc: Benchmark for evaluation of punctuation and capitaliza- tion capabilities of end-to-end asr models
Aleksandr Meister, Matvei Novikov, Nikolay Karpov, Evelina Bakhturina, Vitaly Lavrukhin, and Boris Ginsburg. Librispeech-pc: Benchmark for evaluation of punctuation and capitaliza- tion capabilities of end-to-end asr models. In2023 IEEE automatic speech recognition and understanding workshop (ASRU), pages 1–7. IEEE, 2023
2023
-
[45]
Commonsenseqa: A question answering challenge targeting commonsense knowledge
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, 2019. 12
2019
-
[46]
Enhancing chat language models by scaling high-quality instructional conversations
Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3029–3051, 2023
2023
-
[47]
Scaling rich style-prompted text-to-speech datasets
Anuj Diwan, Zhisheng Zheng, David Harwath, and Eunsol Choi. Scaling rich style-prompted text-to-speech datasets. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3639–3659, 2025
2025
-
[48]
Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008
2008
-
[49]
Crowd-sourced emotional multimodal actors dataset (crema-d), 2025
Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. Crowd-sourced emotional multimodal actors dataset (crema-d), 2025
2025
-
[50]
Surrey audio-visual expressed emotion (savee) database
Philip Jackson and SJUoSG Haq. Surrey audio-visual expressed emotion (savee) database. University of Surrey: Guildford, UK, 2014
2014
-
[51]
University of Toronto, Psychology Department Toronto, ON, Canada, 2010
Kate Dupuis and M Kathleen Pichora-Fuller.Toronto emotional speech set (TESS). University of Toronto, Psychology Department Toronto, ON, Canada, 2010
2010
-
[52]
Emotional voice conversion: Theory, databases and esd.Speech Communication, 137:1–18, 2022
Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li. Emotional voice conversion: Theory, databases and esd.Speech Communication, 137:1–18, 2022
2022
-
[53]
Common voice: A massively-multilingual speech corpus
Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. InProceedings of the twelfth language resources and evaluation conference, pages 4218–4222, 2020
2020
-
[54]
Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,
Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio.arXiv preprint arXiv:2106.06909, 2021
-
[55]
Steven R Livingstone and Frank A Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english.PloS one, 13(5):e0196391, 2018
2018
-
[56]
V oxceleb: Large-scale speaker verification in the wild.Computer Speech & Language, 60:101027, 2020
Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman. V oxceleb: Large-scale speaker verification in the wild.Computer Speech & Language, 60:101027, 2020
2020
-
[57]
arXiv preprint arXiv:1904.02882 , year=
Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. Libritts: A corpus derived from librispeech for text-to-speech.arXiv preprint arXiv:1904.02882, 2019
-
[58]
V oicebench: Benchmarking llm-based voice assistants.Transactions of the Association for Computational Linguistics, 14:378–398, 2026
Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li. V oicebench: Benchmarking llm-based voice assistants.Transactions of the Association for Computational Linguistics, 14:378–398, 2026
2026
-
[59]
Piqa: Reasoning about phys- ical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020
2020
-
[60]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[61]
Air-bench: Benchmarking large audio-language models via generative comprehension
Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, et al. Air-bench: Benchmarking large audio-language models via generative comprehension. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1979–1998, 2024
1979
-
[62]
Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuan- jun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024. 13
work page internal anchor Pith review arXiv 2024
-
[63]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[64]
Distilling an end-to-end voice assistant without instruction training data
William Held, Yanzhe Zhang, Minzhi Li, Weiyan Shi, Michael J Ryan, and Diyi Yang. Distilling an end-to-end voice assistant without instruction training data. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7876–7891, 2025
2025
-
[65]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
Qwen Team. Qwen2.5 technical report, 2024. URL https://arxiv.org/abs/2412.15115
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
V oxeval: Benchmarking the knowledge understanding capabilities of end-to-end spoken language models
Wenqian Cui, Xiaoqi Jiao, Ziqiao Meng, and Irwin King. V oxeval: Benchmarking the knowledge understanding capabilities of end-to-end spoken language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16735–16753, 2025
2025
-
[68]
Text-free prosody-aware generative spoken language modeling
Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu- Anh Nguyen, Morgane Riviere, Abdelrahman Mohamed, Emmanuel Dupoux, et al. Text-free prosody-aware generative spoken language modeling. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8666–8681, 2022
2022
-
[69]
Speechgpt-gen: Scaling chain-of-information speech generation
Dong Zhang, Xin Zhang, Jun Zhan, Shimin Li, Yaqian Zhou, and Xipeng Qiu. Speechgpt-gen: Scaling chain-of-information speech generation.arXiv preprint arXiv:2401.13527, 2024
-
[70]
arXiv preprint arXiv:2403.03100 , year=
Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, et al. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models.arXiv preprint arXiv:2403.03100, 2024
-
[71]
Kaizhi Qian, Xulin Fan, Junrui Ni, Slava Shechtman, Mark Hasegawa-Johnson, Chuang Gan, and Yang Zhang. Prosodylm: Uncovering the emerging prosody processing capabilities in speech language models.arXiv preprint arXiv:2507.20091, 2025. A Discussions & Limitations Although we have shown promising results for TextPro-SLM, the current study has several limita...
-
[72]
Especially, SALAD [11] finds that KL loss achieves significantly better modality alignment results
Knowledge Distillation Loss Type:In prior studies that address the modality gap problem, they typically use Kullback-Leibler (KL)-divergence loss for knowledge distillation [11, 31]. Especially, SALAD [11] finds that KL loss achieves significantly better modality alignment results. Therefore, 17 we investigate whether KL loss or cross entropy loss provide...
-
[73]
Knowledge Distillation Text Token Source:Because WhisperPro internally performs ASR to produce text tokens, an important design question is which text source should be used during distillation. More specifically, we ask whether the student should be trained with ground-truth text, with ASR-transcribed text, or with a mixture in which the student input and...
-
[74]
This is the cleanest setting, and it corresponds to distillation under an idealized assumption of error-free transcription
GTQ_GTA:The student input uses the original ground-truth text, and the teacher distribution is also computed from the original ground-truth text. This is the cleanest setting, and it corresponds to distillation under an idealized assumption of error-free transcription
-
[75]
This setting matches inference-time conditions most closely, because the student is trained on the same type of text tokens it will receive at test time
ASRQ_ASRA:The student input uses the text transcribed by WhisperPro from the speech-form data, and the teacher distribution is also conditioned on this transcribed text. This setting matches inference-time conditions most closely, because the student is trained on the same type of text tokens it will receive at test time. However, the distillation target ...
-
[76]
This setting exposes the student to realistic ASR-imperfect inputs, but still provides a clean teacher target
ASRQ_GTA:The student input uses the text transcribed by WhisperPro, while the teacher distribution is computed from the original ground-truth text. This setting exposes the student to realistic ASR-imperfect inputs, but still provides a clean teacher target. Intuitively, it tests whether the model can learn to recover from typical ASR errors by mapping no...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.