pith. machine review for the scientific record. sign in

arxiv: 2605.05927 · v2 · submitted 2026-05-07 · 💻 cs.CL · cs.SD· eess.AS

Recognition: 2 theorem links

· Lean Theorem

Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

Daxin Tan, Irwin King, Qiyong Zheng, Wenqian Cui, Xiao-Hui Li

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:44 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS
keywords speech large language modelsmodality gapprosody embeddingsparalinguistic understandinginput-side processingWhisperProtext LLM adaptation
0
0 comments X

The pith

Speech LLMs can close their modality gap by feeding prosody embeddings alongside text tokens from the input side.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims the remaining gap between speech and text large language models comes from how speech reaches the model rather than from how the model generates answers. TextPro-SLM addresses this by running audio through WhisperPro to produce ordinary text tokens together with separate prosody embeddings, so the LLM backbone sees spoken input in a form closer to prosody-aware text. The backbone is trained to keep its original semantic performance while learning to use the added prosody signals for paralinguistic understanding. Experiments at 3B and 7B scales show this input-side change produces the smallest modality gap among comparable models and good results on tasks that require tone or emotion awareness, all after only about one thousand hours of audio training. If the claim holds, strong text models could be turned into capable speech models without massive new speech corpora.

Core claim

TextPro-SLM is built by combining WhisperPro, a unified speech encoder that outputs synchronized text tokens and prosody embeddings, with an LLM backbone trained to retain the semantic capabilities of its original text checkpoint while acquiring paralinguistic understanding. This input-side design yields the lowest modality gap observed among leading SLMs at both 3B and 7B scales and competitive performance on paralinguistic tasks, using only roughly 1,000 hours of audio for the LLM training stage.

What carries the argument

WhisperPro, the unified speech encoder that produces synchronized text tokens and prosody embeddings from audio so that spoken input resembles prosody-aware text for the LLM backbone.

If this is right

  • TextPro-SLM records the lowest modality gap among leading SLMs at both 3B and 7B scales.
  • The model maintains strong performance on tasks that require paralinguistic understanding.
  • The improvements are obtained after training the LLM component on only roughly 1,000 hours of audio.
  • Addressing the input side reduces the gap more effectively than prior methods that adjusted only the output side.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Any new text LLM could be adapted into a speech model with comparatively small amounts of additional audio data.
  • The explicit separation of prosody signals might support new tasks such as style-controlled generation or emotion-aware responses.
  • The same input representation could improve robustness to accents or speaking styles that differ from the training distribution.
  • Combining the input-side approach with existing output-side techniques could narrow the remaining gap still further.

Load-bearing premise

The prosody embeddings from WhisperPro can be added to the LLM input without degrading its semantic performance and that this addition is what produces the measured reduction in modality gap.

What would settle it

An ablation in which TextPro-SLM without the prosody embeddings shows no increase in modality gap and no drop on paralinguistic tasks would indicate that the input-side prosody mechanism is not responsible for the reported gains.

Figures

Figures reproduced from arXiv: 2605.05927 by Daxin Tan, Irwin King, Qiyong Zheng, Wenqian Cui, Xiao-Hui Li.

Figure 1
Figure 1. Figure 1: Left: Comparison of the architectural designs used by prior SLMs and our approach. Right: view at source ↗
Figure 2
Figure 2. Figure 2: Model architecture of WhisperPro and TextPro-SLM. view at source ↗
Figure 3
Figure 3. Figure 3: Design choice investigation results for WhisperPro. view at source ↗
read the original abstract

Speech large language models (SLMs) are typically built from text large language model (TLM) checkpoints, yet they still suffer from a substantial modality gap. Prior work has mainly attempted to reduce this gap from the output side by making speech generation more text-like, but the gap remains. We argue that the key remaining bottleneck lies on the input side. We propose TextPro-SLM, an SLM that makes spoken input more closely resemble that of a prosody-aware text LLM. TextPro-SLM combines WhisperPro, a unified speech encoder that produces synchronized text tokens and prosody embeddings, with an LLM backbone trained to preserve the semantic capabilities of the original TLM while learning paralinguistic understanding. Experiments show that TextPro-SLM achieves the lowest modality gap among leading SLMs at both 3B and 7B scales, while also delivering strong overall performance on paralinguistic understanding tasks. These gains are achieved with only roughly 1,000 hours of LLM training audio, suggesting that reducing the modality gap from the input side is both effective and data-efficient.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TextPro-SLM, an SLM constructed from TLM checkpoints that reduces the modality gap on the input side. It uses WhisperPro, a unified speech encoder producing synchronized text tokens and prosody embeddings, paired with an LLM backbone trained to retain original semantic capabilities while acquiring paralinguistic understanding. At both 3B and 7B scales, TextPro-SLM reports the lowest modality gap among compared SLMs, competitive or superior paralinguistic task performance, and these results are obtained with only ~1,000 hours of LLM training audio.

Significance. If the reported reductions in modality gap and preservation of semantic performance hold under the described evaluation protocol, the work demonstrates that input-side prosody integration can be more data-efficient than output-side alignment strategies. The explicit scaling to 3B/7B models and the emphasis on ~1k-hour training budgets provide a concrete, falsifiable benchmark for future input-side modality-bridging methods in speech-language modeling.

major comments (2)
  1. [Experiments] Experiments section: the central claim that semantic capabilities of the original TLM are preserved requires explicit reporting of performance on at least one standard text-only LLM benchmark (e.g., MMLU or a held-out text instruction-following set) before and after the prosody-augmented training; without these numbers the 'prosody-aware text LLM' framing remains under-supported.
  2. [Method / Experiments] The modality-gap metric itself (presumably defined in §3 or §4) is load-bearing for the headline result; the paper should include the exact formula, the reference text embedding space, and error bars across multiple runs or seeds to allow readers to assess whether the reported 'lowest gap' is statistically distinguishable from the next-best baseline.
minor comments (2)
  1. [Abstract] Abstract: quantitative values for the modality gap (e.g., the actual distance or accuracy delta) and the precise paralinguistic task accuracies should be stated so the abstract can stand alone.
  2. [Figures / Tables] Figure captions and table headers should explicitly state the number of runs or seeds used for each reported score; this is a minor but necessary clarification for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation of minor revision. We address each major comment below and will revise the manuscript accordingly to strengthen the evidential support and methodological clarity.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central claim that semantic capabilities of the original TLM are preserved requires explicit reporting of performance on at least one standard text-only LLM benchmark (e.g., MMLU or a held-out text instruction-following set) before and after the prosody-augmented training; without these numbers the 'prosody-aware text LLM' framing remains under-supported.

    Authors: We agree that explicit before-and-after comparisons on standard text-only benchmarks would more robustly substantiate the claim of preserved semantic capabilities. In the revised manuscript we will report results on a held-out text instruction-following evaluation set (and MMLU if space allows) for both the original TLM checkpoint and the fine-tuned TextPro-SLM. This addition will directly address the concern and better support the 'prosody-aware text LLM' description. revision: yes

  2. Referee: [Method / Experiments] The modality-gap metric itself (presumably defined in §3 or §4) is load-bearing for the headline result; the paper should include the exact formula, the reference text embedding space, and error bars across multiple runs or seeds to allow readers to assess whether the reported 'lowest gap' is statistically distinguishable from the next-best baseline.

    Authors: We thank the referee for underscoring the need for full specification of this central metric. We will add the precise mathematical formula for the modality gap, explicitly identify the reference text embedding space, and report error bars (standard deviations across multiple random seeds) in the revised Experiments section. These changes will enable readers to evaluate whether the lowest-gap result is statistically distinguishable from baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on experiments

full rationale

The paper advances an empirical architecture (TextPro-SLM) that fuses a WhisperPro encoder producing synchronized text tokens and prosody embeddings with an LLM backbone. All central claims—lowest modality gap at 3B/7B scales, preserved semantic capability, paralinguistic gains with ~1k hours of audio—are presented as outcomes of training and evaluation protocols rather than as derivations from equations or first-principles results. No equations, fitted parameters renamed as predictions, or self-citation chains that reduce the headline result to its own inputs appear in the provided text. The argument is therefore self-contained and falsifiable via the reported metrics and baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Only the abstract is available, so no concrete free parameters, axioms, or detailed invented entities beyond the high-level model names can be extracted. The approach implicitly relies on standard assumptions about LLM fine-tuning and speech encoding but does not state them explicitly.

invented entities (2)
  • WhisperPro no independent evidence
    purpose: Unified speech encoder that produces synchronized text tokens and prosody embeddings
    New component introduced to address input-side modality gap
  • TextPro-SLM no independent evidence
    purpose: SLM that preserves semantic capabilities of the original TLM while learning paralinguistic understanding
    Proposed model architecture combining WhisperPro with LLM backbone

pith-pipeline@v0.9.0 · 5513 in / 1274 out tokens · 65433 ms · 2026-05-11T00:44:04.669564+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

76 extracted references · 39 canonical work pages · 14 internal anchors

  1. [1]

    Recent advances in speech language models: A survey

    Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Steven Y Guo, and Irwin King. Recent advances in speech language models: A survey. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13943–13970, 2025

  2. [2]

    Llama-omni: Seamless speech interaction with large language models.arXiv preprint arXiv:2409.06666, 2024

    Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. Llama-omni: Seamless speech interaction with large language models.arXiv preprint arXiv:2409.06666, 2024

  3. [3]

    Moshi: a speech-text foundation model for real-time dialogue

    Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

  4. [4]

    Think before you talk: Enhancing meaningful dialogue generation in full-duplex speech language models with planning-inspired text guidance.arXiv preprint arXiv:2508.07375, 2025

    Wenqian Cui, Lei Zhu, Xiaohui Li, Zhihan Guo, Haoli Bai, Lu Hou, and Irwin King. Think before you talk: Enhancing meaningful dialogue generation in full-duplex speech language models with planning-inspired text guidance.arXiv preprint arXiv:2508.07375, 2025

  5. [5]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  6. [6]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  7. [7]

    The llama 4 herd: Architecture, training, evaluation, and deployment notes.arXiv preprint arXiv:2601.11659, 2026

    Aaron Adcock, Aayushi Srivastava, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pande, Ab- hinav Pandey, Abhinav Sharma, Abhishek Kadian, Abhishek Kumawat, Adam Kelsey, et al. The llama 4 herd: Architecture, training, evaluation, and deployment notes.arXiv preprint arXiv:2601.11659, 2026

  8. [8]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  9. [9]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  10. [10]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024

  11. [11]

    Closing the gap between text and speech under- standing in llms.arXiv preprint arXiv:2510.13632, 2025

    Santiago Cuervo, Skyler Seto, Maureen de Seyssel, Richard He Bai, Zijin Gu, Tatiana Likhoma- nenko, Navdeep Jaitly, and Zakaria Aldeneh. Closing the gap between text and speech under- standing in llms.arXiv preprint arXiv:2510.13632, 2025

  12. [12]

    Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612, 2024

    Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612, 2024

  13. [13]

    On generative spoken language modeling from raw audio.Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021

    Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, et al. On generative spoken language modeling from raw audio.Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021

  14. [14]

    Textually pretrained speech language models.Advances in Neural Information Processing Systems, 36, 2024

    Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux, et al. Textually pretrained speech language models.Advances in Neural Information Processing Systems, 36, 2024

  15. [15]

    Costa-jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Benoit Sagot, and Emmanuel Dupoux

    Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R Costa-Jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, et al. Spirit-lm: Interleaved spoken and written language model.arXiv preprint arXiv:2402.05755, 2024. 10

  16. [16]

    Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities

    Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773, Singapore, December 2023. Asso...

  17. [17]

    Mini-omni: Language models can hear, talk while thinking in streaming,

    Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in streaming.arXiv preprint arXiv:2408.16725, 2024

  18. [18]

    Vita-audio: Fast interleaved audio-text token generation for efficient large speech-language model

    Zuwei Long, Yunhang Shen, Chaoyou Fu, Heting Gao, Peixian Chen, Mengdan Zhang, Hang Shao, Jian Li, Jinlong Peng, Haoyu Cao, et al. Vita-audio: Fast interleaved audio-text token generation for efficient large speech-language model. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  19. [19]

    Llama-omni 2: Llm- based real-time spoken chatbot with autoregressive streaming speech synthesis

    Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, and Yang Feng. Llama-omni 2: Llm- based real-time spoken chatbot with autoregressive streaming speech synthesis. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18617–18629, 2025

  20. [20]

    Minmo: A multimodal large language model for seamless voice interaction.CoRR, abs/2501.06282, 2025

    Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, et al. Minmo: A multimodal large language model for seamless voice interaction.arXiv preprint arXiv:2501.06282, 2025

  21. [21]

    V ocalnet: Speech llms with multi-token prediction for faster and high-quality generation

    Yuhao Wang, Heyang Liu, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, and Yu Wang. V ocalnet: Speech llms with multi-token prediction for faster and high-quality generation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19595–19612, 2025

  22. [22]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

  23. [23]

    Qwen2.5-Omni Technical Report

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report, 2025. URLhttps://arxiv.org/abs/2503.20215

  24. [24]

    Understanding the modality gap: An empirical study on the speech-text alignment mechanism of large speech language models

    Bajian Xiang, Shuaijiang Zhao, Tingwei Guo, and Wei Zou. Understanding the modality gap: An empirical study on the speech-text alignment mechanism of large speech language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5187–5202, 2025

  25. [25]

    Soundwave: Less is more for speech-text alignment in llms

    Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, and Haizhou Li. Soundwave: Less is more for speech-text alignment in llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18718–18738, 2025

  26. [26]

    wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

    Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

  27. [27]

    W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

    Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250. IEEE, 2021

  28. [28]

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

  29. [29]

    Kimi-Audio Technical Report

    Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, et al. Kimi-audio technical report.arXiv preprint arXiv:2504.18425, 2025. 11

  30. [30]

    Anatomy of the modality gap: Dissecting the internal states of end-to-end speech llms.arXiv preprint arXiv:2603.01502, 2026

    Ming-Hao Hsu, Xueyao Zhang, Xiaohai Tian, Jun Zhang, and Zhizheng Wu. Anatomy of the modality gap: Dissecting the internal states of end-to-end speech llms.arXiv preprint arXiv:2603.01502, 2026

  31. [31]

    Cross-modal knowledge distillation for speech large language models.arXiv preprint arXiv:2509.14930, 2025

    Enzhi Wang, Qicheng Li, Zhiyuan Tang, and Yuhang Jia. Cross-modal knowledge distillation for speech large language models.arXiv preprint arXiv:2509.14930, 2025

  32. [32]

    X-opd: Cross-modal on-policy distillation for capability alignment in speech llms.arXiv preprint arXiv:2603.24596, 2026

    Di Cao, Dongjie Fu, Hai Yu, Siqi Zheng, Xu Tan, and Tao Jin. X-opd: Cross-modal on-policy distillation for capability alignment in speech llms.arXiv preprint arXiv:2603.24596, 2026

  33. [33]

    Cord: Bridging the audio-text reasoning gap via weighted on-policy cross-modal distillation.arXiv preprint arXiv:2601.16547, 2026

    Jing Hu, Danxiang Zhu, Xianlong Luo, Dan Zhang, Shuwei He, Yishu Lei, Haitao Zheng, Shikun Feng, Jingzhou He, Yu Sun, et al. Cord: Bridging the audio-text reasoning gap via weighted on-policy cross-modal distillation.arXiv preprint arXiv:2601.16547, 2026

  34. [34]

    Teaching audio models to reason: A unified framework for source-and layer-wise distillation.arXiv preprint arXiv:2509.18579, 2025

    Runyan Yang, Yuke Si, Yingying Gao, Junlan Feng, Chao Deng, and Shilei Zhang. Teaching audio models to reason: A unified framework for source-and layer-wise distillation.arXiv preprint arXiv:2509.18579, 2025

  35. [35]

    Deepomni: Towards seamless and smart speech interaction with adaptive modality-specific moe.arXiv preprint arXiv:2506.21864, 2025

    Hang Shao, Heting Gao, Yunhang Shen, Jiawei Chen, Zuwei Long, Dong Yang, Ke Li, and Xing Sun. Deepomni: Towards seamless and smart speech interaction with adaptive modality-specific moe.arXiv preprint arXiv:2506.21864, 2025

  36. [36]

    A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications

    Siyuan Mu and Sen Lin. A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications.arXiv preprint arXiv:2503.07137, 2025

  37. [37]

    Analyzing mitigation strategies for catastrophic forgetting in end-to-end training of spoken language models.arXiv preprint arXiv:2505.17496, 2025

    Chi-Yuan Hsiao, Ke-Han Lu, Kai-Wei Chang, Chih-Kai Yang, Wei-Chih Chen, and Hung-yi Lee. Analyzing mitigation strategies for catastrophic forgetting in end-to-end training of spoken language models.arXiv preprint arXiv:2505.17496, 2025

  38. [38]

    Recent advances of multimodal continual learning: A comprehensive survey.arXiv preprint arXiv:2410.05352, 2024

    Dianzhi Yu, Xinni Zhang, Yankai Chen, Aiwei Liu, Yifei Zhang, Philip S Yu, and Irwin King. Recent advances of multimodal continual learning: A comprehensive survey.arXiv preprint arXiv:2410.05352, 2024

  39. [39]

    Audiolm: a language modeling approach to audio generation.IEEE/ACM transactions on audio, speech, and language processing, 31:2523–2533, 2023

    Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al. Audiolm: a language modeling approach to audio generation.IEEE/ACM transactions on audio, speech, and language processing, 31:2523–2533, 2023

  40. [40]

    Hasrd: Hierarchical acoustic and semantic representation disentanglement.arXiv preprint arXiv:2506.00843, 2025

    Amir Hussein, Sameer Khurana, Gordon Wichern, Francois G Germain, and Jonathan Le Roux. Hasrd: Hierarchical acoustic and semantic representation disentanglement.arXiv preprint arXiv:2506.00843, 2025

  41. [41]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

  42. [42]

    Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello

    Shengpeng Ji, Ziyue Jiang, Wen Wang, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Xize Cheng, Zehan Wang, Ruiqi Li, et al. Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling.arXiv preprint arXiv:2408.16532, 2024

  43. [43]

    Librispeech: an asr corpus based on public domain audio books

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015

  44. [44]

    Librispeech-pc: Benchmark for evaluation of punctuation and capitaliza- tion capabilities of end-to-end asr models

    Aleksandr Meister, Matvei Novikov, Nikolay Karpov, Evelina Bakhturina, Vitaly Lavrukhin, and Boris Ginsburg. Librispeech-pc: Benchmark for evaluation of punctuation and capitaliza- tion capabilities of end-to-end asr models. In2023 IEEE automatic speech recognition and understanding workshop (ASRU), pages 1–7. IEEE, 2023

  45. [45]

    Commonsenseqa: A question answering challenge targeting commonsense knowledge

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, 2019. 12

  46. [46]

    Enhancing chat language models by scaling high-quality instructional conversations

    Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3029–3051, 2023

  47. [47]

    Scaling rich style-prompted text-to-speech datasets

    Anuj Diwan, Zhisheng Zheng, David Harwath, and Eunsol Choi. Scaling rich style-prompted text-to-speech datasets. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3639–3659, 2025

  48. [48]

    Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

    Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

  49. [49]

    Crowd-sourced emotional multimodal actors dataset (crema-d), 2025

    Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. Crowd-sourced emotional multimodal actors dataset (crema-d), 2025

  50. [50]

    Surrey audio-visual expressed emotion (savee) database

    Philip Jackson and SJUoSG Haq. Surrey audio-visual expressed emotion (savee) database. University of Surrey: Guildford, UK, 2014

  51. [51]

    University of Toronto, Psychology Department Toronto, ON, Canada, 2010

    Kate Dupuis and M Kathleen Pichora-Fuller.Toronto emotional speech set (TESS). University of Toronto, Psychology Department Toronto, ON, Canada, 2010

  52. [52]

    Emotional voice conversion: Theory, databases and esd.Speech Communication, 137:1–18, 2022

    Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li. Emotional voice conversion: Theory, databases and esd.Speech Communication, 137:1–18, 2022

  53. [53]

    Common voice: A massively-multilingual speech corpus

    Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. InProceedings of the twelfth language resources and evaluation conference, pages 4218–4222, 2020

  54. [54]

    Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,

    Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio.arXiv preprint arXiv:2106.06909, 2021

  55. [55]

    Steven R Livingstone and Frank A Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english.PloS one, 13(5):e0196391, 2018

  56. [56]

    V oxceleb: Large-scale speaker verification in the wild.Computer Speech & Language, 60:101027, 2020

    Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman. V oxceleb: Large-scale speaker verification in the wild.Computer Speech & Language, 60:101027, 2020

  57. [57]

    arXiv preprint arXiv:1904.02882 , year=

    Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. Libritts: A corpus derived from librispeech for text-to-speech.arXiv preprint arXiv:1904.02882, 2019

  58. [58]

    V oicebench: Benchmarking llm-based voice assistants.Transactions of the Association for Computational Linguistics, 14:378–398, 2026

    Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li. V oicebench: Benchmarking llm-based voice assistants.Transactions of the Association for Computational Linguistics, 14:378–398, 2026

  59. [59]

    Piqa: Reasoning about phys- ical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

  60. [60]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  61. [61]

    Air-bench: Benchmarking large audio-language models via generative comprehension

    Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, et al. Air-bench: Benchmarking large audio-language models via generative comprehension. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1979–1998, 2024

  62. [62]

    Qwen2-Audio Technical Report

    Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuan- jun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024. 13

  63. [63]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  64. [64]

    Distilling an end-to-end voice assistant without instruction training data

    William Held, Yanzhe Zhang, Minzhi Li, Weiyan Shi, Michael J Ryan, and Diyi Yang. Distilling an end-to-end voice assistant without instruction training data. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7876–7891, 2025

  65. [65]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  66. [66]

    Qwen2.5 Technical Report

    Qwen Team. Qwen2.5 technical report, 2024. URL https://arxiv.org/abs/2412.15115

  67. [67]

    V oxeval: Benchmarking the knowledge understanding capabilities of end-to-end spoken language models

    Wenqian Cui, Xiaoqi Jiao, Ziqiao Meng, and Irwin King. V oxeval: Benchmarking the knowledge understanding capabilities of end-to-end spoken language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16735–16753, 2025

  68. [68]

    Text-free prosody-aware generative spoken language modeling

    Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu- Anh Nguyen, Morgane Riviere, Abdelrahman Mohamed, Emmanuel Dupoux, et al. Text-free prosody-aware generative spoken language modeling. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8666–8681, 2022

  69. [69]

    Speechgpt-gen: Scaling chain-of-information speech generation

    Dong Zhang, Xin Zhang, Jun Zhan, Shimin Li, Yaqian Zhou, and Xipeng Qiu. Speechgpt-gen: Scaling chain-of-information speech generation.arXiv preprint arXiv:2401.13527, 2024

  70. [70]

    arXiv preprint arXiv:2403.03100 , year=

    Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, et al. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models.arXiv preprint arXiv:2403.03100, 2024

  71. [71]

    thinking

    Kaizhi Qian, Xulin Fan, Junrui Ni, Slava Shechtman, Mark Hasegawa-Johnson, Chuang Gan, and Yang Zhang. Prosodylm: Uncovering the emerging prosody processing capabilities in speech language models.arXiv preprint arXiv:2507.20091, 2025. A Discussions & Limitations Although we have shown promising results for TextPro-SLM, the current study has several limita...

  72. [72]

    Especially, SALAD [11] finds that KL loss achieves significantly better modality alignment results

    Knowledge Distillation Loss Type:In prior studies that address the modality gap problem, they typically use Kullback-Leibler (KL)-divergence loss for knowledge distillation [11, 31]. Especially, SALAD [11] finds that KL loss achieves significantly better modality alignment results. Therefore, 17 we investigate whether KL loss or cross entropy loss provide...

  73. [73]

    Knowledge Distillation Text Token Source:Because WhisperPro internally performs ASR to produce text tokens, an important design question is which text source should be used during distillation. More specifically, we ask whether the student should be trained with ground-truth text, with ASR-transcribed text, or with a mixture in which the student input and...

  74. [74]

    This is the cleanest setting, and it corresponds to distillation under an idealized assumption of error-free transcription

    GTQ_GTA:The student input uses the original ground-truth text, and the teacher distribution is also computed from the original ground-truth text. This is the cleanest setting, and it corresponds to distillation under an idealized assumption of error-free transcription

  75. [75]

    This setting matches inference-time conditions most closely, because the student is trained on the same type of text tokens it will receive at test time

    ASRQ_ASRA:The student input uses the text transcribed by WhisperPro from the speech-form data, and the teacher distribution is also conditioned on this transcribed text. This setting matches inference-time conditions most closely, because the student is trained on the same type of text tokens it will receive at test time. However, the distillation target ...

  76. [76]

    This setting exposes the student to realistic ASR-imperfect inputs, but still provides a clean teacher target

    ASRQ_GTA:The student input uses the text transcribed by WhisperPro, while the teacher distribution is computed from the original ground-truth text. This setting exposes the student to realistic ASR-imperfect inputs, but still provides a clean teacher target. Intuitively, it tests whether the model can learn to recover from typical ASR errors by mapping no...