arxiv: 2605.05927 · v2 · submitted 2026-05-07 · 💻 cs.CL · cs.SD· eess.AS

Recognition: 2 theorem links

· Lean Theorem

Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

Daxin Tan, Irwin King, Qiyong Zheng, Wenqian Cui, Xiao-Hui Li

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:44 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS

keywords speech large language modelsmodality gapprosody embeddingsparalinguistic understandinginput-side processingWhisperProtext LLM adaptation

0 comments

The pith

Speech LLMs can close their modality gap by feeding prosody embeddings alongside text tokens from the input side.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims the remaining gap between speech and text large language models comes from how speech reaches the model rather than from how the model generates answers. TextPro-SLM addresses this by running audio through WhisperPro to produce ordinary text tokens together with separate prosody embeddings, so the LLM backbone sees spoken input in a form closer to prosody-aware text. The backbone is trained to keep its original semantic performance while learning to use the added prosody signals for paralinguistic understanding. Experiments at 3B and 7B scales show this input-side change produces the smallest modality gap among comparable models and good results on tasks that require tone or emotion awareness, all after only about one thousand hours of audio training. If the claim holds, strong text models could be turned into capable speech models without massive new speech corpora.

Core claim

TextPro-SLM is built by combining WhisperPro, a unified speech encoder that outputs synchronized text tokens and prosody embeddings, with an LLM backbone trained to retain the semantic capabilities of its original text checkpoint while acquiring paralinguistic understanding. This input-side design yields the lowest modality gap observed among leading SLMs at both 3B and 7B scales and competitive performance on paralinguistic tasks, using only roughly 1,000 hours of audio for the LLM training stage.

What carries the argument

WhisperPro, the unified speech encoder that produces synchronized text tokens and prosody embeddings from audio so that spoken input resembles prosody-aware text for the LLM backbone.

If this is right

TextPro-SLM records the lowest modality gap among leading SLMs at both 3B and 7B scales.
The model maintains strong performance on tasks that require paralinguistic understanding.
The improvements are obtained after training the LLM component on only roughly 1,000 hours of audio.
Addressing the input side reduces the gap more effectively than prior methods that adjusted only the output side.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Any new text LLM could be adapted into a speech model with comparatively small amounts of additional audio data.
The explicit separation of prosody signals might support new tasks such as style-controlled generation or emotion-aware responses.
The same input representation could improve robustness to accents or speaking styles that differ from the training distribution.
Combining the input-side approach with existing output-side techniques could narrow the remaining gap still further.

Load-bearing premise

The prosody embeddings from WhisperPro can be added to the LLM input without degrading its semantic performance and that this addition is what produces the measured reduction in modality gap.

What would settle it

An ablation in which TextPro-SLM without the prosody embeddings shows no increase in modality gap and no drop on paralinguistic tasks would indicate that the input-side prosody mechanism is not responsible for the reported gains.

Figures

Figures reproduced from arXiv: 2605.05927 by Daxin Tan, Irwin King, Qiyong Zheng, Wenqian Cui, Xiao-Hui Li.

**Figure 1.** Figure 1: Left: Comparison of the architectural designs used by prior SLMs and our approach. Right: view at source ↗

**Figure 2.** Figure 2: Model architecture of WhisperPro and TextPro-SLM. view at source ↗

**Figure 3.** Figure 3: Design choice investigation results for WhisperPro. view at source ↗

read the original abstract

Speech large language models (SLMs) are typically built from text large language model (TLM) checkpoints, yet they still suffer from a substantial modality gap. Prior work has mainly attempted to reduce this gap from the output side by making speech generation more text-like, but the gap remains. We argue that the key remaining bottleneck lies on the input side. We propose TextPro-SLM, an SLM that makes spoken input more closely resemble that of a prosody-aware text LLM. TextPro-SLM combines WhisperPro, a unified speech encoder that produces synchronized text tokens and prosody embeddings, with an LLM backbone trained to preserve the semantic capabilities of the original TLM while learning paralinguistic understanding. Experiments show that TextPro-SLM achieves the lowest modality gap among leading SLMs at both 3B and 7B scales, while also delivering strong overall performance on paralinguistic understanding tasks. These gains are achieved with only roughly 1,000 hours of LLM training audio, suggesting that reducing the modality gap from the input side is both effective and data-efficient.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TextPro-SLM shows that input-side prosody embeddings from a unified encoder can close the modality gap in speech LLMs more effectively than output-side fixes, while staying data-efficient at 3B and 7B scales.

read the letter

The main thing to know is that this paper argues the remaining modality gap in speech LLMs comes from how spoken input is fed in, not just how output is generated. They build TextPro-SLM around WhisperPro, which produces aligned text tokens and prosody embeddings, then train the LLM backbone to keep its original text semantics while learning paralinguistic signals. The result is the lowest reported modality gap among current SLMs at both sizes, plus solid performance on understanding tasks, all after only about 1,000 hours of audio training data.

Referee Report

2 major / 2 minor

Summary. The paper introduces TextPro-SLM, an SLM constructed from TLM checkpoints that reduces the modality gap on the input side. It uses WhisperPro, a unified speech encoder producing synchronized text tokens and prosody embeddings, paired with an LLM backbone trained to retain original semantic capabilities while acquiring paralinguistic understanding. At both 3B and 7B scales, TextPro-SLM reports the lowest modality gap among compared SLMs, competitive or superior paralinguistic task performance, and these results are obtained with only ~1,000 hours of LLM training audio.

Significance. If the reported reductions in modality gap and preservation of semantic performance hold under the described evaluation protocol, the work demonstrates that input-side prosody integration can be more data-efficient than output-side alignment strategies. The explicit scaling to 3B/7B models and the emphasis on ~1k-hour training budgets provide a concrete, falsifiable benchmark for future input-side modality-bridging methods in speech-language modeling.

major comments (2)

[Experiments] Experiments section: the central claim that semantic capabilities of the original TLM are preserved requires explicit reporting of performance on at least one standard text-only LLM benchmark (e.g., MMLU or a held-out text instruction-following set) before and after the prosody-augmented training; without these numbers the 'prosody-aware text LLM' framing remains under-supported.
[Method / Experiments] The modality-gap metric itself (presumably defined in §3 or §4) is load-bearing for the headline result; the paper should include the exact formula, the reference text embedding space, and error bars across multiple runs or seeds to allow readers to assess whether the reported 'lowest gap' is statistically distinguishable from the next-best baseline.

minor comments (2)

[Abstract] Abstract: quantitative values for the modality gap (e.g., the actual distance or accuracy delta) and the precise paralinguistic task accuracies should be stated so the abstract can stand alone.
[Figures / Tables] Figure captions and table headers should explicitly state the number of runs or seeds used for each reported score; this is a minor but necessary clarification for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation of minor revision. We address each major comment below and will revise the manuscript accordingly to strengthen the evidential support and methodological clarity.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claim that semantic capabilities of the original TLM are preserved requires explicit reporting of performance on at least one standard text-only LLM benchmark (e.g., MMLU or a held-out text instruction-following set) before and after the prosody-augmented training; without these numbers the 'prosody-aware text LLM' framing remains under-supported.

Authors: We agree that explicit before-and-after comparisons on standard text-only benchmarks would more robustly substantiate the claim of preserved semantic capabilities. In the revised manuscript we will report results on a held-out text instruction-following evaluation set (and MMLU if space allows) for both the original TLM checkpoint and the fine-tuned TextPro-SLM. This addition will directly address the concern and better support the 'prosody-aware text LLM' description. revision: yes
Referee: [Method / Experiments] The modality-gap metric itself (presumably defined in §3 or §4) is load-bearing for the headline result; the paper should include the exact formula, the reference text embedding space, and error bars across multiple runs or seeds to allow readers to assess whether the reported 'lowest gap' is statistically distinguishable from the next-best baseline.

Authors: We thank the referee for underscoring the need for full specification of this central metric. We will add the precise mathematical formula for the modality gap, explicitly identify the reference text embedding space, and report error bars (standard deviations across multiple random seeds) in the revised Experiments section. These changes will enable readers to evaluate whether the lowest-gap result is statistically distinguishable from baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on experiments

full rationale

The paper advances an empirical architecture (TextPro-SLM) that fuses a WhisperPro encoder producing synchronized text tokens and prosody embeddings with an LLM backbone. All central claims—lowest modality gap at 3B/7B scales, preserved semantic capability, paralinguistic gains with ~1k hours of audio—are presented as outcomes of training and evaluation protocols rather than as derivations from equations or first-principles results. No equations, fitted parameters renamed as predictions, or self-citation chains that reduce the headline result to its own inputs appear in the provided text. The argument is therefore self-contained and falsifiable via the reported metrics and baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Only the abstract is available, so no concrete free parameters, axioms, or detailed invented entities beyond the high-level model names can be extracted. The approach implicitly relies on standard assumptions about LLM fine-tuning and speech encoding but does not state them explicitly.

invented entities (2)

WhisperPro no independent evidence
purpose: Unified speech encoder that produces synchronized text tokens and prosody embeddings
New component introduced to address input-side modality gap
TextPro-SLM no independent evidence
purpose: SLM that preserves semantic capabilities of the original TLM while learning paralinguistic understanding
Proposed model architecture combining WhisperPro with LLM backbone

pith-pipeline@v0.9.0 · 5513 in / 1274 out tokens · 65433 ms · 2026-05-11T00:44:04.669564+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
WhisperPro ... produces synchronized text tokens and prosody embeddings ... mel-reconstructor ... L = L_ASR + λ L_mel
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
TextPro-SLM ... makes spoken input more closely resemble that of a prosody-aware text LLM

Reference graph

Works this paper leans on

76 extracted references · 39 canonical work pages · 14 internal anchors

[1]

Recent advances in speech language models: A survey

Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Steven Y Guo, and Irwin King. Recent advances in speech language models: A survey. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13943–13970, 2025

2025
[2]

Llama-omni: Seamless speech interaction with large language models.arXiv preprint arXiv:2409.06666, 2024

Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. Llama-omni: Seamless speech interaction with large language models.arXiv preprint arXiv:2409.06666, 2024

work page arXiv 2024
[3]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review arXiv 2024
[4]

Think before you talk: Enhancing meaningful dialogue generation in full-duplex speech language models with planning-inspired text guidance.arXiv preprint arXiv:2508.07375, 2025

Wenqian Cui, Lei Zhu, Xiaohui Li, Zhihan Guo, Haoli Bai, Lu Hou, and Irwin King. Think before you talk: Enhancing meaningful dialogue generation in full-duplex speech language models with planning-inspired text guidance.arXiv preprint arXiv:2508.07375, 2025

work page arXiv 2025
[5]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

The llama 4 herd: Architecture, training, evaluation, and deployment notes.arXiv preprint arXiv:2601.11659, 2026

Aaron Adcock, Aayushi Srivastava, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pande, Ab- hinav Pandey, Abhinav Sharma, Abhishek Kadian, Abhishek Kumawat, Adam Kelsey, et al. The llama 4 herd: Architecture, training, evaluation, and deployment notes.arXiv preprint arXiv:2601.11659, 2026

work page arXiv 2026
[8]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024

work page internal anchor Pith review arXiv 2024
[11]

Closing the gap between text and speech under- standing in llms.arXiv preprint arXiv:2510.13632, 2025

Santiago Cuervo, Skyler Seto, Maureen de Seyssel, Richard He Bai, Zijin Gu, Tatiana Likhoma- nenko, Navdeep Jaitly, and Zakaria Aldeneh. Closing the gap between text and speech under- standing in llms.arXiv preprint arXiv:2510.13632, 2025

work page arXiv 2025
[12]

Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612, 2024

Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612, 2024

work page arXiv 2024
[13]

On generative spoken language modeling from raw audio.Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021

Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, et al. On generative spoken language modeling from raw audio.Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021

2021
[14]

Textually pretrained speech language models.Advances in Neural Information Processing Systems, 36, 2024

Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux, et al. Textually pretrained speech language models.Advances in Neural Information Processing Systems, 36, 2024

2024
[15]

Costa-jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Benoit Sagot, and Emmanuel Dupoux

Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R Costa-Jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, et al. Spirit-lm: Interleaved spoken and written language model.arXiv preprint arXiv:2402.05755, 2024. 10

work page arXiv 2024
[16]

Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773, Singapore, December 2023. Asso...

work page doi:10.18653/v1/2023.findings-emnlp.1055 2023
[17]

Mini-omni: Language models can hear, talk while thinking in streaming,

Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in streaming.arXiv preprint arXiv:2408.16725, 2024

work page arXiv 2024
[18]

Vita-audio: Fast interleaved audio-text token generation for efficient large speech-language model

Zuwei Long, Yunhang Shen, Chaoyou Fu, Heting Gao, Peixian Chen, Mengdan Zhang, Hang Shao, Jian Li, Jinlong Peng, Haoyu Cao, et al. Vita-audio: Fast interleaved audio-text token generation for efficient large speech-language model. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[19]

Llama-omni 2: Llm- based real-time spoken chatbot with autoregressive streaming speech synthesis

Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, and Yang Feng. Llama-omni 2: Llm- based real-time spoken chatbot with autoregressive streaming speech synthesis. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18617–18629, 2025

2025
[20]

Minmo: A multimodal large language model for seamless voice interaction.CoRR, abs/2501.06282, 2025

Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, et al. Minmo: A multimodal large language model for seamless voice interaction.arXiv preprint arXiv:2501.06282, 2025

work page arXiv 2025
[21]

V ocalnet: Speech llms with multi-token prediction for faster and high-quality generation

Yuhao Wang, Heyang Liu, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, and Yu Wang. V ocalnet: Speech llms with multi-token prediction for faster and high-quality generation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19595–19612, 2025

2025
[22]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report, 2025. URLhttps://arxiv.org/abs/2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Understanding the modality gap: An empirical study on the speech-text alignment mechanism of large speech language models

Bajian Xiang, Shuaijiang Zhao, Tingwei Guo, and Wei Zou. Understanding the modality gap: An empirical study on the speech-text alignment mechanism of large speech language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5187–5202, 2025

2025
[25]

Soundwave: Less is more for speech-text alignment in llms

Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, and Haizhou Li. Soundwave: Less is more for speech-text alignment in llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18718–18738, 2025

2025
[26]

wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

2020
[27]

W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250. IEEE, 2021

2021
[28]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

2021
[29]

Kimi-Audio Technical Report

Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, et al. Kimi-audio technical report.arXiv preprint arXiv:2504.18425, 2025. 11

work page internal anchor Pith review arXiv 2025
[30]

Anatomy of the modality gap: Dissecting the internal states of end-to-end speech llms.arXiv preprint arXiv:2603.01502, 2026

Ming-Hao Hsu, Xueyao Zhang, Xiaohai Tian, Jun Zhang, and Zhizheng Wu. Anatomy of the modality gap: Dissecting the internal states of end-to-end speech llms.arXiv preprint arXiv:2603.01502, 2026

work page arXiv 2026
[31]

Cross-modal knowledge distillation for speech large language models.arXiv preprint arXiv:2509.14930, 2025

Enzhi Wang, Qicheng Li, Zhiyuan Tang, and Yuhang Jia. Cross-modal knowledge distillation for speech large language models.arXiv preprint arXiv:2509.14930, 2025

work page arXiv 2025
[32]

X-opd: Cross-modal on-policy distillation for capability alignment in speech llms.arXiv preprint arXiv:2603.24596, 2026

Di Cao, Dongjie Fu, Hai Yu, Siqi Zheng, Xu Tan, and Tao Jin. X-opd: Cross-modal on-policy distillation for capability alignment in speech llms.arXiv preprint arXiv:2603.24596, 2026

work page arXiv 2026
[33]

Cord: Bridging the audio-text reasoning gap via weighted on-policy cross-modal distillation.arXiv preprint arXiv:2601.16547, 2026

Jing Hu, Danxiang Zhu, Xianlong Luo, Dan Zhang, Shuwei He, Yishu Lei, Haitao Zheng, Shikun Feng, Jingzhou He, Yu Sun, et al. Cord: Bridging the audio-text reasoning gap via weighted on-policy cross-modal distillation.arXiv preprint arXiv:2601.16547, 2026

work page arXiv 2026
[34]

Teaching audio models to reason: A unified framework for source-and layer-wise distillation.arXiv preprint arXiv:2509.18579, 2025

Runyan Yang, Yuke Si, Yingying Gao, Junlan Feng, Chao Deng, and Shilei Zhang. Teaching audio models to reason: A unified framework for source-and layer-wise distillation.arXiv preprint arXiv:2509.18579, 2025

work page arXiv 2025
[35]

Deepomni: Towards seamless and smart speech interaction with adaptive modality-specific moe.arXiv preprint arXiv:2506.21864, 2025

Hang Shao, Heting Gao, Yunhang Shen, Jiawei Chen, Zuwei Long, Dong Yang, Ke Li, and Xing Sun. Deepomni: Towards seamless and smart speech interaction with adaptive modality-specific moe.arXiv preprint arXiv:2506.21864, 2025

work page arXiv 2025
[36]

A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications

Siyuan Mu and Sen Lin. A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications.arXiv preprint arXiv:2503.07137, 2025

work page arXiv 2025
[37]

Analyzing mitigation strategies for catastrophic forgetting in end-to-end training of spoken language models.arXiv preprint arXiv:2505.17496, 2025

Chi-Yuan Hsiao, Ke-Han Lu, Kai-Wei Chang, Chih-Kai Yang, Wei-Chih Chen, and Hung-yi Lee. Analyzing mitigation strategies for catastrophic forgetting in end-to-end training of spoken language models.arXiv preprint arXiv:2505.17496, 2025

work page arXiv 2025
[38]

Recent advances of multimodal continual learning: A comprehensive survey.arXiv preprint arXiv:2410.05352, 2024

Dianzhi Yu, Xinni Zhang, Yankai Chen, Aiwei Liu, Yifei Zhang, Philip S Yu, and Irwin King. Recent advances of multimodal continual learning: A comprehensive survey.arXiv preprint arXiv:2410.05352, 2024

work page arXiv 2024
[39]

Audiolm: a language modeling approach to audio generation.IEEE/ACM transactions on audio, speech, and language processing, 31:2523–2533, 2023

Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al. Audiolm: a language modeling approach to audio generation.IEEE/ACM transactions on audio, speech, and language processing, 31:2523–2533, 2023

2023
[40]

Hasrd: Hierarchical acoustic and semantic representation disentanglement.arXiv preprint arXiv:2506.00843, 2025

Amir Hussein, Sameer Khurana, Gordon Wichern, Francois G Germain, and Jonathan Le Roux. Hasrd: Hierarchical acoustic and semantic representation disentanglement.arXiv preprint arXiv:2506.00843, 2025

work page arXiv 2025
[41]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

2023
[42]

Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello

Shengpeng Ji, Ziyue Jiang, Wen Wang, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Xize Cheng, Zehan Wang, Ruiqi Li, et al. Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling.arXiv preprint arXiv:2408.16532, 2024

work page arXiv 2024
[43]

Librispeech: an asr corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015

2015
[44]

Librispeech-pc: Benchmark for evaluation of punctuation and capitaliza- tion capabilities of end-to-end asr models

Aleksandr Meister, Matvei Novikov, Nikolay Karpov, Evelina Bakhturina, Vitaly Lavrukhin, and Boris Ginsburg. Librispeech-pc: Benchmark for evaluation of punctuation and capitaliza- tion capabilities of end-to-end asr models. In2023 IEEE automatic speech recognition and understanding workshop (ASRU), pages 1–7. IEEE, 2023

2023
[45]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, 2019. 12

2019
[46]

Enhancing chat language models by scaling high-quality instructional conversations

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3029–3051, 2023

2023
[47]

Scaling rich style-prompted text-to-speech datasets

Anuj Diwan, Zhisheng Zheng, David Harwath, and Eunsol Choi. Scaling rich style-prompted text-to-speech datasets. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3639–3659, 2025

2025
[48]

Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

2008
[49]

Crowd-sourced emotional multimodal actors dataset (crema-d), 2025

Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. Crowd-sourced emotional multimodal actors dataset (crema-d), 2025

2025
[50]

Surrey audio-visual expressed emotion (savee) database

Philip Jackson and SJUoSG Haq. Surrey audio-visual expressed emotion (savee) database. University of Surrey: Guildford, UK, 2014

2014
[51]

University of Toronto, Psychology Department Toronto, ON, Canada, 2010

Kate Dupuis and M Kathleen Pichora-Fuller.Toronto emotional speech set (TESS). University of Toronto, Psychology Department Toronto, ON, Canada, 2010

2010
[52]

Emotional voice conversion: Theory, databases and esd.Speech Communication, 137:1–18, 2022

Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li. Emotional voice conversion: Theory, databases and esd.Speech Communication, 137:1–18, 2022

2022
[53]

Common voice: A massively-multilingual speech corpus

Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. InProceedings of the twelfth language resources and evaluation conference, pages 4218–4222, 2020

2020
[54]

Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,

Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio.arXiv preprint arXiv:2106.06909, 2021

work page arXiv 2021
[55]

Steven R Livingstone and Frank A Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english.PloS one, 13(5):e0196391, 2018

2018
[56]

V oxceleb: Large-scale speaker verification in the wild.Computer Speech & Language, 60:101027, 2020

Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman. V oxceleb: Large-scale speaker verification in the wild.Computer Speech & Language, 60:101027, 2020

2020
[57]

arXiv preprint arXiv:1904.02882 , year=

Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. Libritts: A corpus derived from librispeech for text-to-speech.arXiv preprint arXiv:1904.02882, 2019

work page arXiv 1904
[58]

V oicebench: Benchmarking llm-based voice assistants.Transactions of the Association for Computational Linguistics, 14:378–398, 2026

Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li. V oicebench: Benchmarking llm-based voice assistants.Transactions of the Association for Computational Linguistics, 14:378–398, 2026

2026
[59]

Piqa: Reasoning about phys- ical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

2020
[60]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[61]

Air-bench: Benchmarking large audio-language models via generative comprehension

Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, et al. Air-bench: Benchmarking large audio-language models via generative comprehension. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1979–1998, 2024

1979
[62]

Qwen2-Audio Technical Report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuan- jun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024. 13

work page internal anchor Pith review arXiv 2024
[63]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Distilling an end-to-end voice assistant without instruction training data

William Held, Yanzhe Zhang, Minzhi Li, Weiyan Shi, Michael J Ryan, and Diyi Yang. Distilling an end-to-end voice assistant without instruction training data. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7876–7891, 2025

2025
[65]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report, 2024. URL https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

V oxeval: Benchmarking the knowledge understanding capabilities of end-to-end spoken language models

Wenqian Cui, Xiaoqi Jiao, Ziqiao Meng, and Irwin King. V oxeval: Benchmarking the knowledge understanding capabilities of end-to-end spoken language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16735–16753, 2025

2025
[68]

Text-free prosody-aware generative spoken language modeling

Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu- Anh Nguyen, Morgane Riviere, Abdelrahman Mohamed, Emmanuel Dupoux, et al. Text-free prosody-aware generative spoken language modeling. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8666–8681, 2022

2022
[69]

Speechgpt-gen: Scaling chain-of-information speech generation

Dong Zhang, Xin Zhang, Jun Zhan, Shimin Li, Yaqian Zhou, and Xipeng Qiu. Speechgpt-gen: Scaling chain-of-information speech generation.arXiv preprint arXiv:2401.13527, 2024

work page arXiv 2024
[70]

arXiv preprint arXiv:2403.03100 , year=

Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, et al. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models.arXiv preprint arXiv:2403.03100, 2024

work page arXiv 2024
[71]

thinking

Kaizhi Qian, Xulin Fan, Junrui Ni, Slava Shechtman, Mark Hasegawa-Johnson, Chuang Gan, and Yang Zhang. Prosodylm: Uncovering the emerging prosody processing capabilities in speech language models.arXiv preprint arXiv:2507.20091, 2025. A Discussions & Limitations Although we have shown promising results for TextPro-SLM, the current study has several limita...

work page arXiv 2025
[72]

Especially, SALAD [11] finds that KL loss achieves significantly better modality alignment results

Knowledge Distillation Loss Type:In prior studies that address the modality gap problem, they typically use Kullback-Leibler (KL)-divergence loss for knowledge distillation [11, 31]. Especially, SALAD [11] finds that KL loss achieves significantly better modality alignment results. Therefore, 17 we investigate whether KL loss or cross entropy loss provide...
[73]

Knowledge Distillation Text Token Source:Because WhisperPro internally performs ASR to produce text tokens, an important design question is which text source should be used during distillation. More specifically, we ask whether the student should be trained with ground-truth text, with ASR-transcribed text, or with a mixture in which the student input and...
[74]

This is the cleanest setting, and it corresponds to distillation under an idealized assumption of error-free transcription

GTQ_GTA:The student input uses the original ground-truth text, and the teacher distribution is also computed from the original ground-truth text. This is the cleanest setting, and it corresponds to distillation under an idealized assumption of error-free transcription
[75]

This setting matches inference-time conditions most closely, because the student is trained on the same type of text tokens it will receive at test time

ASRQ_ASRA:The student input uses the text transcribed by WhisperPro from the speech-form data, and the teacher distribution is also conditioned on this transcribed text. This setting matches inference-time conditions most closely, because the student is trained on the same type of text tokens it will receive at test time. However, the distillation target ...
[76]

This setting exposes the student to realistic ASR-imperfect inputs, but still provides a clean teacher target

ASRQ_GTA:The student input uses the text transcribed by WhisperPro, while the teacher distribution is computed from the original ground-truth text. This setting exposes the student to realistic ASR-imperfect inputs, but still provides a clean teacher target. Intuitively, it tests whether the model can learn to recover from typical ASR errors by mapping no...