MOSS-Audio Technical Report
Pith reviewed 2026-06-28 13:03 UTC · model grok-4.3
The pith
MOSS-Audio couples an audio encoder to a language model with cross-layer injection and explicit time markers to support captioning, transcription, and reasoning over speech, sounds, and music.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR by coupling a dedicated audio encoder with a modality adapter and a large language model, incorporating DeepStack cross-layer feature injection and time markers, and using an event-preserving audio annotation pipeline for pretraining and SFT data construction.
What carries the argument
DeepStack cross-layer feature injection together with inserted time markers, which supplies the decoder with acoustic features from multiple encoder depths and explicit temporal position cues.
If this is right
- The model supports time-aware question answering and audio-grounded reasoning after multi-stage post-training.
- Both 4B and 8B parameter versions are released in Instruct and Thinking configurations.
- Intermediate branch-specific captions are retained to build task-oriented supervised fine-tuning data.
- Time-aware objectives during pretraining enable temporal grounding in the generated outputs.
Where Pith is reading between the lines
- Voice-agent systems could treat MOSS-Audio as a reusable audio-understanding base rather than training separate models per task.
- The retained branch-specific captions may allow targeted fine-tuning for speech-only or music-only applications without full retraining.
- Similar cross-layer injection and marker techniques could be tested on other encoder-decoder pairs beyond the current audio setup.
- Scaling the pretraining data volume while keeping the same annotation pipeline would test whether the reported gains persist at larger sizes.
Load-bearing premise
The combination of DeepStack injection, time markers, and the event-preserving annotation pipeline is responsible for the measured performance gains.
What would settle it
An ablation that removes DeepStack, time markers, or the event-boundary segmentation step and shows no drop in scores on the reported audio tasks.
read the original abstract
MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates autoregressive text outputs. Two design choices are central to the system: DeepStack cross-layer feature injection, which exposes the decoder to acoustic information from multiple encoder depths, and time markers, which provide explicit temporal cues by inserting timestamp markers into the audio-token stream. At the data level, we design an event-preserving audio annotation pipeline that segments raw audio at coherent event boundaries, applies branch-specific annotation to speech, music, and general audio, and merges the results into unified captions for pretraining. The intermediate branch-specific captions are further retained to support the construction of task-oriented SFT data. The model is pretrained on large-scale audio-language data, with time-aware objectives incorporated to support temporal grounding, and then undergoes multi-stage post-training to enhance instruction following and audio-grounded reasoning. We release 4B and 8B variants in both Instruct and Thinking configurations. MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a promising understanding foundation for future voice agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents MOSS-Audio, a unified audio-language model coupling a 12.5 Hz audio encoder, modality adapter, and LLM decoder. It highlights two central design choices—DeepStack cross-layer feature injection and explicit time markers—plus an event-preserving annotation pipeline that segments audio at event boundaries and produces branch-specific captions for pretraining and SFT. The model is pretrained with time-aware objectives and post-trained for instruction following; 4B and 8B Instruct and Thinking variants are released. The abstract asserts that the system achieves strong performance on general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a foundation for voice agents.
Significance. If the performance claims and the causal contribution of the three highlighted design choices were substantiated by controlled benchmarks and ablations, the work would supply a concrete, temporally grounded audio-language model that could serve as a reusable backbone for downstream voice agents. The explicit retention of intermediate branch-specific captions for SFT data construction is a practical strength that could be reused by others.
major comments (2)
- [Abstract] Abstract: The central claim that DeepStack, time markers, and the event-preserving pipeline produce the reported performance gains is unsupported by any quantitative evidence. No benchmarks, baseline comparisons, ablation tables, or error bars are supplied to isolate the contribution of these components versus scale, data volume, or the base LLM, rendering the positioning as a 'promising understanding foundation' unevaluable.
- [Abstract] Abstract (and implied results sections): The statement that the model 'achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR' is presented without any task-specific metrics, datasets, or comparison models, which is load-bearing for the paper's primary assertion.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We agree that the performance-related claims lack supporting quantitative evidence and will revise the abstract accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that DeepStack, time markers, and the event-preserving pipeline produce the reported performance gains is unsupported by any quantitative evidence. No benchmarks, baseline comparisons, ablation tables, or error bars are supplied to isolate the contribution of these components versus scale, data volume, or the base LLM, rendering the positioning as a 'promising understanding foundation' unevaluable.
Authors: We agree with this assessment. The manuscript does not contain ablations, benchmarks, or quantitative comparisons that isolate the contributions of DeepStack, time markers, or the annotation pipeline. We will revise the abstract to describe these design choices and the overall architecture without asserting that they produce specific performance gains relative to scale or other factors. revision: yes
-
Referee: [Abstract] Abstract (and implied results sections): The statement that the model 'achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR' is presented without any task-specific metrics, datasets, or comparison models, which is load-bearing for the paper's primary assertion.
Authors: This comment is correct. The abstract currently includes an unsupported claim of strong performance. We will revise the abstract to state that the model supports and has been trained for the listed tasks (general audio understanding, speech captioning, ASR, and timestamped ASR) while removing the assertion of strong performance in the absence of reported metrics or comparisons. revision: yes
Circularity Check
No circularity in system description
full rationale
The paper is a technical report describing an audio-language model architecture, data pipeline, and training stages with no equations, derivations, predictions, or first-principles results. Claims of performance are presented as empirical outcomes rather than derived quantities. No load-bearing steps reduce to inputs by construction, and no self-citation chains or ansatzes are invoked in a manner that creates circularity. The central description remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
MMSU: A massive multi-task spoken language understanding and reasoning benchmark
Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. MMSU: A massive multi-task spoken language understanding and reasoning benchmark. arXiv preprint arXiv:2506.04779 , 2025
Pith/arXiv arXiv 2025
-
[2]
Audio set: An ontology and human-labeled dataset for audio events
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages 776–780. IEEE, 2017
2017
-
[3]
Audiocaps: Generating captions for audios in the wild,
Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. AudioCaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 119–132. Association for Computational Linguistics, 2019. doi: 10.18653/v1/N19-1011
-
[4]
Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, T om Ko, Chengqi Zhao, Mark D. Plumbley , Yuexian Zou, and Wenwu Wang. WavCaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. IEEE/ACM T ransactions on Audio, Speech, and Language Processing, 2024. doi: 10.1109/TASLP. 2024.3419446
-
[5]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey , and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning , pages 28492–28518. PMLR, 2023
2023
-
[6]
Qwen- Audio: Advancing universal audio understanding via unified large-scale audio-language models
Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen- Audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919, 2023
Pith/arXiv arXiv 2023
-
[7]
SALMONN: T owards generic hearing abilities for large language models
Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. SALMONN: T owards generic hearing abilities for large language models. In The T welfth International Conference on Learning Representations, 2024
2024
-
[8]
Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities
Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research , pages 25125–25148. PMLR, 2024
2024
-
[9]
Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv , Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report. arXiv preprint arXiv:2407.10759, 2024
Pith/arXiv arXiv 2024
-
[10]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215, 2025
Pith/arXiv arXiv 2025
-
[11]
Sakshi, Oriol Nieto, Ra- mani Duraiswami, and Dinesh Manocha
Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh T yagi, S. Sakshi, Oriol Nieto, Ra- mani Duraiswami, and Dinesh Manocha. GAMA: A large audio-language model with advanced audio understand- ing and complex reasoning abilities. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association fo...
2024
-
[12]
Audio flamingo next: Next- generation open audio-language models for speech, sound, and music
Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar, Lasha Koroshinadze, Nishit Anand, Zhifeng Kong, Siddharth Gururani, Sang-gil Lee, Jaehyeon Kim, Aya Aljafari, Chao-Han Huck Yang, Sungwon Kim, Ramani Duraiswami, Dinesh Manocha, Mohammad Shoeybi, Bryan Catanzaro, Ming-Yu Liu, and Wei Ping. Audio flamingo next: Next- generation open audio-language models for ...
Pith/arXiv arXiv 2026
-
[13]
Enhancing temporal understanding in audio question answer- ing for large audio language models
Arvind Krishna Sridhar, Yinyi Guo, and Erik Visser. Enhancing temporal understanding in audio question answer- ing for large audio language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language T echnologies, Industry T rack, pages 1026–1035. Association for Co...
2025
-
[14]
Sakshi, Utkarsh T yagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha
S. Sakshi, Utkarsh T yagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. MMAU: A massive multi-task audio understanding and reasoning benchmark. In The Thirteenth International Conference on Learning Representations , 2025
2025
-
[15]
MMAR: A challenging benchmark for deep reasoning in speech, audio, music, and their mix
Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, et al. MMAR: A challenging benchmark for deep reasoning in speech, audio, music, and their mix. arXiv preprint arXiv:2505.13032, 2025. 19
arXiv 2025
-
[16]
Liu, Hongyin Luo, Leonid Karlinsky , and James Glass
Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky , and James Glass. Joint audio and speech under- standing. arXiv preprint arXiv:2309.14405, 2023
arXiv 2023
-
[17]
DeepStack: Deeply stacking visual tokens is surprisingly simple and effective for large multimodal models
Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zuxuan Wu, Jianfeng Gao, and Yu-Gang Jiang. DeepStack: Deeply stacking visual tokens is surprisingly simple and effective for large multimodal models. In Advances in Neural Information Processing Systems, 2024
2024
-
[18]
Sonal Kumar, Šimon Sedláček, Vaibhavi Lokegaonkar, Fernando López, Wenyi Yu, Nishit Anand, Hyeonggon Ryu, Lichang Chen, Maxim Plička, Miroslav Hlaváček, et al. MMAU-Pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence. arXiv preprint arXiv:2508.13992, 2025
arXiv 2025
-
[19]
Ankita Pasad, Ju-Chieh Chou, and Karen Livescu. Layer-wise analysis of a self-supervised speech representation model. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021, Cartagena, Colombia, December 13-17, 2021 , pages 914–921. IEEE, 2021. doi: 10.1109/ASRU51503.2021.9688093. URL https://doi.org/10.1109/ ASRU51503.2021.9688093
-
[20]
Wavlm: Large-scale self-supervised pre-training for full stack speech processing
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Y oshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected T opics in Signal Processing, 16(6):1505–1518, 2022
2022
-
[21]
Superb: Speech processing universal performance benchmark
Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y Lin, Andy T Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, et al. Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051, 2021
arXiv 2021
-
[22]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025
Pith/arXiv arXiv 2025
-
[23]
Moss transcribe diarize technical report
Donghua Yu, Zhengyuan Lin, Chen Yang, Yiyang Zhang, Hanfu Chen, et al. Moss transcribe diarize technical report. arXiv preprint arXiv:2601.01554, 2026
arXiv 2026
-
[24]
BEATs: Audio pre-training with acoustic tokenizers
Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel T ompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei. BEATs: Audio pre-training with acoustic tokenizers. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research , pages 5178–5193. PMLR, 2023
2023
-
[25]
Effective pre-training of audio transformers for sound event detection, 2024
Florian Schmid, T obias Morocutti, Francesco Foscarin, Jan Schlüter, Paul Primus, and Gerhard Widmer. Effective pre-training of audio transformers for sound event detection, 2024. URL https://arxiv.org/abs/2409.09546
arXiv 2024
-
[26]
Qwen3-omni technical report, 2025
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv , Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo...
Pith/arXiv arXiv 2025
-
[27]
Fun-asr technical report, 2025
Keyu An, Yanni Chen, Zhigao Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Bo Gong, Xiangang Li, Yabin Li, Ying Liu, Xiang Lv , Yunjie Ji, Yiheng Jiang, Bin Ma, Haoneng Luo, Chongjia Ni, Zexu Pan, Yiping Peng, Zhendong Peng, Peiyao Wang, Hao Wang, Haoxu Wang, Wen Wang, Wupeng Wang, Yuzhong Wu, Biao Tian, Zhentao Tan, Nan Yang, Bin Yuan, Jieping Ye...
arXiv 2025
-
[28]
Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, Jin Xu, Jingren Zhou, and Junyang Lin. Qwen3-asr technical report. arXiv preprint arXiv:2601.21337, 2026
Pith/arXiv arXiv 2026
-
[29]
Bag of tricks for efficient text classification
Armand Joulin, Edouard Grave, Piotr Bojanowski, and T omas Mikolov . Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: V olume 2, Short Papers, pages 427–431. Association for Computational Linguistics, April 2017
2017
-
[30]
Fasttext.zip: Compressing text classification models
Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and T omas Mikolov . Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016
Pith/arXiv arXiv 2016
-
[31]
Scaling speech technology to 1,000+ languages
Vineel Pratap, Andros Tjandra, Bowen Shi, Paden T omasello, Arun Babu, Sayani Kundu, Ali Elkahky , Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. Scaling speech technology to 1,000+ languages. arXiv, 2023. 20
2023
-
[32]
Scaling speech technology to 1,000+ languages
Vineel Pratap, Andros Tjandra, Bowen Shi, Paden T omasello, Arun Babu, Sayani Kundu, Ali Elkahky , Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25(97):1–52,
-
[33]
URL http://jmlr.org/papers/v25/23-1318.html
-
[34]
Leveraging self- supervised learning for speaker diarization
Jiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Diez, and Lukáš Burget. Leveraging self- supervised learning for speaker diarization. In Proc. ICASSP, 2025
2025
-
[35]
Detect any sound: Open-vocabulary sound event detection with multi-modal queries, 2025
Pengfei Cai, Yan Song, Qing Gu, Nan Jiang, Haoyu Song, and Ian McLoughlin. Detect any sound: Open-vocabulary sound event detection with multi-modal queries, 2025. URL https://arxiv.org/abs/2507.16343
arXiv 2025
-
[36]
Timeaudio: Bridging temporal gaps in large audio-language models
Hualei Wang, Yiming Li, Shuo Ma, Hong Liu, and Xiangdong Wang. Timeaudio: Bridging temporal gaps in large audio-language models. arXiv preprint arXiv:2511.11039, 2025
arXiv 2025
-
[37]
Bryan, Zeyu Jin, and Justin Salamon
Sonal Kumar, Prem Seetharaman, Ke Chen, Oriol Nieto, Jiaqi Su, Zhepei Wang, Rithesh Kumar, Dinesh Manocha, Nicholas J. Bryan, Zeyu Jin, and Justin Salamon. Tac: Timestamped audio captioning, 2026. URL https://arxiv.org/ abs/2602.15766
arXiv 2026
-
[38]
Music flamingo: Scaling music understanding in audio language models, 2025
Sreyan Ghosh, Arushi Goel, Lasha Koroshinadze, Sang gil Lee, Zhifeng Kong, Joao Felipe Santos, Ramani Du- raiswami, Dinesh Manocha, Wei Ping, Mohammad Shoeybi, and Bryan Catanzaro. Music flamingo: Scaling music understanding in audio language models, 2025. URL https://arxiv.org/abs/2511.10289
arXiv 2025
-
[39]
Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro
Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S. Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro. Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities. In Proceedings of the 42nd International Conference on Machine Learning , Proceedings of Machine Learning Research. PMLR, 2025
2025
-
[40]
Approximate note transcription for the improved identification of difficult chords
Matthias Mauch and Simon Dixon. Approximate note transcription for the improved identification of difficult chords. In Proceedings of the 11th International Society for Music Information Retrieval Conference, pages 135–140, Utrecht, The Netherlands, 2010
2010
-
[41]
Beatnet: Crnn and particle filtering for online joint beat downbeat and meter tracking
Mojtaba Heydari, Frank Cwitkowitz, and Zhiyao Duan. Beatnet: Crnn and particle filtering for online joint beat downbeat and meter tracking. 2021
2021
-
[42]
madmom: a new python audio and music signal processing library , 2016
Sebastian Böck, Filip Korzeniowski, Jan Schlüter, Florian Krebs, and Gerhard Widmer. madmom: a new python audio and music signal processing library , 2016. URL https://arxiv.org/abs/1605.07008
Pith/arXiv arXiv 2016
-
[43]
Essentia: an open-source library for sound and music analysis
Dmitry Bogdanov , Nicolas Wack, Emilia Gómez, Sankalp Gulati, Perfecto Herrera, Oscar Mayor, Gerard Roma, Justin Salamon, José Zapata, and Xavier Serra. Essentia: an open-source library for sound and music analysis. In Proceedings of the 21st ACM International Conference on Multimedia , MM ’13, page 855 ⚶858, New York, NY , USA,
-
[44]
Association for Computing Machinery. ISBN 9781450324045. doi: 10.1145/2502081.2502229. URL https: //doi.org/10.1145/2502081.2502229
-
[45]
Codified audio language modeling learns useful representa- tions for music information retrieval
Rodrigo Castellon, Chris Donahue, and Percy Liang. Codified audio language modeling learns useful representa- tions for music information retrieval. In ISMIR, 2021
2021
-
[46]
Songformer: Scaling music structure analysis with heterogeneous supervision, 2026
Chunbo Hao, Ruibin Yuan, Jixun Yao, Qixin Deng, Xinyi Bai, Yanbo Wang, Wei Xue, and Lei Xie. Songformer: Scaling music structure analysis with heterogeneous supervision, 2026. URL https://arxiv.org/abs/2510.02797
Pith/arXiv arXiv 2026
-
[47]
Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing
Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, T om Ko, Qing Li, Yu Zhang, et al. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. arXiv preprint arXiv:2110.07205, 2021
arXiv 2021
-
[48]
Unified speech-text pre-training for speech translation and recognition
Yun Tang, Hongyu Gong, Ning Dong, Changhan Wang, Wei-Ning Hsu, Jiatao Gu, Alexei Baevski, Xian Li, Ab- delrahman Mohamed, Michael Auli, et al. Unified speech-text pre-training for speech translation and recognition. arXiv preprint arXiv:2204.05409, 2022
arXiv 2022
-
[49]
Speechgpt: Empow- ering large language models with intrinsic cross-modal conversational abilities
Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Empow- ering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000 , 2023. 21
arXiv 2023
-
[50]
Spirit-lm: Interleaved spoken and written language model
T u Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R Costa-Jussa, Maha Elbayad, Sravya Popuri, Christophe Ropers, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov , et al. Spirit-lm: Interleaved spoken and written language model. T ransactions of the Association for Computational Linguistics, 13:30–52, 2025
2025
-
[51]
Moshi: a speech-text foundation model for real-time dialogue
Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037 , 2024
Pith/arXiv arXiv 2024
-
[52]
Mini-omni: Language models can hear, talk while thinking in streaming
Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in streaming. arXiv preprint arXiv:2408.16725, 2024
arXiv 2024
-
[53]
Glm-4-voice: T owards intelligent and human-like end-to-end spoken chatbot
Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. Glm-4-voice: T owards intelligent and human-like end-to-end spoken chatbot. arXiv preprint arXiv:2412.02612, 2024
Pith/arXiv arXiv 2024
-
[54]
Baichuan-audio: A unified framework for end-to-end speech interaction
Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, et al. Baichuan-audio: A unified framework for end-to-end speech interaction. arXiv preprint arXiv:2502.17239, 2025
arXiv 2025
-
[55]
Step-audio: Unified understanding and generation in intelligent speech interaction
Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, et al. Step-audio: Unified understanding and generation in intelligent speech interaction. arXiv preprint arXiv:2502.11946, 2025
Pith/arXiv arXiv 2025
-
[56]
Hubert: Self-supervised speech representation learning by masked prediction of hidden units
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov , and Abdelrah- man Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing , 29:3451–3460, 2021
2021
-
[57]
CLAP: Learning audio concepts from natural language supervision
Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. CLAP: Learning audio concepts from natural language supervision. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2023
2023
-
[58]
Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation
Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Marianna Nezhurina, Taylor Berg-Kirkpatrick, and Shlomo Dub- nov . Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing , 2023
2023
-
[59]
High fidelity neural audio compression
Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022
Pith/arXiv arXiv 2022
-
[60]
High-fidelity audio compression with improved rvqgan
Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High-fidelity audio compression with improved rvqgan. Advances in Neural Information Processing Systems, 36:27980–27993, 2023
2023
-
[61]
Speechtokenizer: Unified speech tokenizer for speech large language models
Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu. Speechtokenizer: Unified speech tokenizer for speech large language models. arXiv preprint arXiv:2308.16692, 2023
arXiv 2023
-
[62]
Codec does matter: Exploring the semantic shortcoming of codec for audio language model
Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, et al. Codec does matter: Exploring the semantic shortcoming of codec for audio language model. InProceedings of the AAAI Conference on Artificial Intelligence , volume 39, pages 25697–25705, 2025
2025
-
[63]
Spotsound: Enhancing large audio- language models with fine-grained temporal grounding
Luoyi Sun, Xiao Zhou, Zeqian Li, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spotsound: Enhancing large audio- language models with fine-grained temporal grounding. arXiv preprint arXiv:2604.13023, 2026
Pith/arXiv arXiv 2026
-
[64]
Shimin Chen, Xiaohan Lan, Yitian Yuan, Zequn Jie, and Lin Ma. Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability. arXiv preprint arXiv:2411.18211, 2024
arXiv 2024
-
[65]
The interspeech 2026 audio encoder capability challenge for large audio lan- guage models, 2026
Heinrich Dinkel, Jiahao Zhou, Guanbo Wang, Yadong Niu, Junbo Zhang, Yufeng Hao, Ying Liu, Ke Li, Wenwu Wang, Zhiyong Wu, and Jian Luan. The interspeech 2026 audio encoder capability challenge for large audio lan- guage models, 2026. URL https://arxiv.org/abs/2603.22728
arXiv 2026
-
[66]
Mecat: A multi-experts constructed benchmark for fine-grained audio understanding tasks
Yadong Niu, Tianzi Wang, Heinrich Dinkel, Xingwei Sun, Jiahao Zhou, Gang Li, Jizhong Liu, Xunying Liu, Junbo Zhang, and Jian Luan. Mecat: A multi-experts constructed benchmark for fine-grained audio understanding tasks. arXiv preprint arXiv:2507.23511, 2025. 22 A Additional Details A.1 Evaluation Prompts Shared Audio-Text Evaluation Template [system] You ...
Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.