pith. sign in

arxiv: 2606.01802 · v3 · pith:QE3OTD3Qnew · submitted 2026-06-01 · 💻 cs.SD · cs.AI

MOSS-Audio Technical Report

Pith reviewed 2026-06-28 13:03 UTC · model grok-4.3

classification 💻 cs.SD cs.AI
keywords audio-language modelspeech understandingaudio captioningtimestamped ASRtemporal groundingmultimodal modelvoice agentsevent annotation
0
0 comments X

The pith

MOSS-Audio couples an audio encoder to a language model with cross-layer injection and explicit time markers to support captioning, transcription, and reasoning over speech, sounds, and music.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MOSS-Audio as a single model that processes raw audio into text outputs for multiple tasks. It feeds 12.5 Hz encoder features through a modality adapter into an LLM decoder, then adds two mechanisms to improve temporal and acoustic fidelity. DeepStack pulls features from several encoder layers at once, while time markers are inserted directly into the token stream. The training data comes from an annotation pipeline that splits audio at natural event boundaries and merges speech, music, and general-sound captions. After large-scale pretraining and staged post-training, the model reports strong results on general audio understanding, speech captioning, ASR, and timestamped ASR.

Core claim

MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR by coupling a dedicated audio encoder with a modality adapter and a large language model, incorporating DeepStack cross-layer feature injection and time markers, and using an event-preserving audio annotation pipeline for pretraining and SFT data construction.

What carries the argument

DeepStack cross-layer feature injection together with inserted time markers, which supplies the decoder with acoustic features from multiple encoder depths and explicit temporal position cues.

If this is right

  • The model supports time-aware question answering and audio-grounded reasoning after multi-stage post-training.
  • Both 4B and 8B parameter versions are released in Instruct and Thinking configurations.
  • Intermediate branch-specific captions are retained to build task-oriented supervised fine-tuning data.
  • Time-aware objectives during pretraining enable temporal grounding in the generated outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Voice-agent systems could treat MOSS-Audio as a reusable audio-understanding base rather than training separate models per task.
  • The retained branch-specific captions may allow targeted fine-tuning for speech-only or music-only applications without full retraining.
  • Similar cross-layer injection and marker techniques could be tested on other encoder-decoder pairs beyond the current audio setup.
  • Scaling the pretraining data volume while keeping the same annotation pipeline would test whether the reported gains persist at larger sizes.

Load-bearing premise

The combination of DeepStack injection, time markers, and the event-preserving annotation pipeline is responsible for the measured performance gains.

What would settle it

An ablation that removes DeepStack, time markers, or the event-boundary segmentation step and shows no drop in scores on the reported audio tasks.

read the original abstract

MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates autoregressive text outputs. Two design choices are central to the system: DeepStack cross-layer feature injection, which exposes the decoder to acoustic information from multiple encoder depths, and time markers, which provide explicit temporal cues by inserting timestamp markers into the audio-token stream. At the data level, we design an event-preserving audio annotation pipeline that segments raw audio at coherent event boundaries, applies branch-specific annotation to speech, music, and general audio, and merges the results into unified captions for pretraining. The intermediate branch-specific captions are further retained to support the construction of task-oriented SFT data. The model is pretrained on large-scale audio-language data, with time-aware objectives incorporated to support temporal grounding, and then undergoes multi-stage post-training to enhance instruction following and audio-grounded reasoning. We release 4B and 8B variants in both Instruct and Thinking configurations. MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a promising understanding foundation for future voice agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents MOSS-Audio, a unified audio-language model coupling a 12.5 Hz audio encoder, modality adapter, and LLM decoder. It highlights two central design choices—DeepStack cross-layer feature injection and explicit time markers—plus an event-preserving annotation pipeline that segments audio at event boundaries and produces branch-specific captions for pretraining and SFT. The model is pretrained with time-aware objectives and post-trained for instruction following; 4B and 8B Instruct and Thinking variants are released. The abstract asserts that the system achieves strong performance on general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a foundation for voice agents.

Significance. If the performance claims and the causal contribution of the three highlighted design choices were substantiated by controlled benchmarks and ablations, the work would supply a concrete, temporally grounded audio-language model that could serve as a reusable backbone for downstream voice agents. The explicit retention of intermediate branch-specific captions for SFT data construction is a practical strength that could be reused by others.

major comments (2)
  1. [Abstract] Abstract: The central claim that DeepStack, time markers, and the event-preserving pipeline produce the reported performance gains is unsupported by any quantitative evidence. No benchmarks, baseline comparisons, ablation tables, or error bars are supplied to isolate the contribution of these components versus scale, data volume, or the base LLM, rendering the positioning as a 'promising understanding foundation' unevaluable.
  2. [Abstract] Abstract (and implied results sections): The statement that the model 'achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR' is presented without any task-specific metrics, datasets, or comparison models, which is load-bearing for the paper's primary assertion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that the performance-related claims lack supporting quantitative evidence and will revise the abstract accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that DeepStack, time markers, and the event-preserving pipeline produce the reported performance gains is unsupported by any quantitative evidence. No benchmarks, baseline comparisons, ablation tables, or error bars are supplied to isolate the contribution of these components versus scale, data volume, or the base LLM, rendering the positioning as a 'promising understanding foundation' unevaluable.

    Authors: We agree with this assessment. The manuscript does not contain ablations, benchmarks, or quantitative comparisons that isolate the contributions of DeepStack, time markers, or the annotation pipeline. We will revise the abstract to describe these design choices and the overall architecture without asserting that they produce specific performance gains relative to scale or other factors. revision: yes

  2. Referee: [Abstract] Abstract (and implied results sections): The statement that the model 'achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR' is presented without any task-specific metrics, datasets, or comparison models, which is load-bearing for the paper's primary assertion.

    Authors: This comment is correct. The abstract currently includes an unsupported claim of strong performance. We will revise the abstract to state that the model supports and has been trained for the listed tasks (general audio understanding, speech captioning, ASR, and timestamped ASR) while removing the assertion of strong performance in the absence of reported metrics or comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity in system description

full rationale

The paper is a technical report describing an audio-language model architecture, data pipeline, and training stages with no equations, derivations, predictions, or first-principles results. Claims of performance are presented as empirical outcomes rather than derived quantities. No load-bearing steps reduce to inputs by construction, and no self-citation chains or ansatzes are invoked in a manner that creates circularity. The central description remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5896 in / 1058 out tokens · 24194 ms · 2026-06-28T13:03:26.576590+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 4 canonical work pages

  1. [1]

    MMSU: A massive multi-task spoken language understanding and reasoning benchmark

    Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. MMSU: A massive multi-task spoken language understanding and reasoning benchmark. arXiv preprint arXiv:2506.04779 , 2025

  2. [2]

    Audio set: An ontology and human-labeled dataset for audio events

    Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages 776–780. IEEE, 2017

  3. [3]

    Audiocaps: Generating captions for audios in the wild,

    Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. AudioCaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 119–132. Association for Computational Linguistics, 2019. doi: 10.18653/v1/N19-1011

  4. [4]

    pushing-out

    Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, T om Ko, Chengqi Zhao, Mark D. Plumbley , Yuexian Zou, and Wenwu Wang. WavCaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. IEEE/ACM T ransactions on Audio, Speech, and Language Processing, 2024. doi: 10.1109/TASLP. 2024.3419446

  5. [5]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey , and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning , pages 28492–28518. PMLR, 2023

  6. [6]

    Qwen- Audio: Advancing universal audio understanding via unified large-scale audio-language models

    Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen- Audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919, 2023

  7. [7]

    SALMONN: T owards generic hearing abilities for large language models

    Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. SALMONN: T owards generic hearing abilities for large language models. In The T welfth International Conference on Learning Representations, 2024

  8. [8]

    Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities

    Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research , pages 25125–25148. PMLR, 2024

  9. [9]

    Qwen2-audio technical report

    Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv , Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report. arXiv preprint arXiv:2407.10759, 2024

  10. [10]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215, 2025

  11. [11]

    Sakshi, Oriol Nieto, Ra- mani Duraiswami, and Dinesh Manocha

    Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh T yagi, S. Sakshi, Oriol Nieto, Ra- mani Duraiswami, and Dinesh Manocha. GAMA: A large audio-language model with advanced audio understand- ing and complex reasoning abilities. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association fo...

  12. [12]

    Audio flamingo next: Next- generation open audio-language models for speech, sound, and music

    Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar, Lasha Koroshinadze, Nishit Anand, Zhifeng Kong, Siddharth Gururani, Sang-gil Lee, Jaehyeon Kim, Aya Aljafari, Chao-Han Huck Yang, Sungwon Kim, Ramani Duraiswami, Dinesh Manocha, Mohammad Shoeybi, Bryan Catanzaro, Ming-Yu Liu, and Wei Ping. Audio flamingo next: Next- generation open audio-language models for ...

  13. [13]

    Enhancing temporal understanding in audio question answer- ing for large audio language models

    Arvind Krishna Sridhar, Yinyi Guo, and Erik Visser. Enhancing temporal understanding in audio question answer- ing for large audio language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language T echnologies, Industry T rack, pages 1026–1035. Association for Co...

  14. [14]

    Sakshi, Utkarsh T yagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha

    S. Sakshi, Utkarsh T yagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. MMAU: A massive multi-task audio understanding and reasoning benchmark. In The Thirteenth International Conference on Learning Representations , 2025

  15. [15]

    MMAR: A challenging benchmark for deep reasoning in speech, audio, music, and their mix

    Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, et al. MMAR: A challenging benchmark for deep reasoning in speech, audio, music, and their mix. arXiv preprint arXiv:2505.13032, 2025. 19

  16. [16]

    Liu, Hongyin Luo, Leonid Karlinsky , and James Glass

    Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky , and James Glass. Joint audio and speech under- standing. arXiv preprint arXiv:2309.14405, 2023

  17. [17]

    DeepStack: Deeply stacking visual tokens is surprisingly simple and effective for large multimodal models

    Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zuxuan Wu, Jianfeng Gao, and Yu-Gang Jiang. DeepStack: Deeply stacking visual tokens is surprisingly simple and effective for large multimodal models. In Advances in Neural Information Processing Systems, 2024

  18. [18]

    MMAU-Pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence

    Sonal Kumar, Šimon Sedláček, Vaibhavi Lokegaonkar, Fernando López, Wenyi Yu, Nishit Anand, Hyeonggon Ryu, Lichang Chen, Maxim Plička, Miroslav Hlaváček, et al. MMAU-Pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence. arXiv preprint arXiv:2508.13992, 2025

  19. [19]

    w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

    Ankita Pasad, Ju-Chieh Chou, and Karen Livescu. Layer-wise analysis of a self-supervised speech representation model. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021, Cartagena, Colombia, December 13-17, 2021 , pages 914–921. IEEE, 2021. doi: 10.1109/ASRU51503.2021.9688093. URL https://doi.org/10.1109/ ASRU51503.2021.9688093

  20. [20]

    Wavlm: Large-scale self-supervised pre-training for full stack speech processing

    Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Y oshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected T opics in Signal Processing, 16(6):1505–1518, 2022

  21. [21]

    Superb: Speech processing universal performance benchmark

    Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y Lin, Andy T Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, et al. Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051, 2021

  22. [22]

    Qwen3-vl technical report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025

  23. [23]

    Moss transcribe diarize technical report

    Donghua Yu, Zhengyuan Lin, Chen Yang, Yiyang Zhang, Hanfu Chen, et al. Moss transcribe diarize technical report. arXiv preprint arXiv:2601.01554, 2026

  24. [24]

    BEATs: Audio pre-training with acoustic tokenizers

    Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel T ompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei. BEATs: Audio pre-training with acoustic tokenizers. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research , pages 5178–5193. PMLR, 2023

  25. [25]

    Effective pre-training of audio transformers for sound event detection, 2024

    Florian Schmid, T obias Morocutti, Francesco Foscarin, Jan Schlüter, Paul Primus, and Gerhard Widmer. Effective pre-training of audio transformers for sound event detection, 2024. URL https://arxiv.org/abs/2409.09546

  26. [26]

    Qwen3-omni technical report, 2025

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv , Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo...

  27. [27]

    Fun-asr technical report, 2025

    Keyu An, Yanni Chen, Zhigao Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Bo Gong, Xiangang Li, Yabin Li, Ying Liu, Xiang Lv , Yunjie Ji, Yiheng Jiang, Bin Ma, Haoneng Luo, Chongjia Ni, Zexu Pan, Yiping Peng, Zhendong Peng, Peiyao Wang, Hao Wang, Haoxu Wang, Wen Wang, Wupeng Wang, Yuzhong Wu, Biao Tian, Zhentao Tan, Nan Yang, Bin Yuan, Jieping Ye...

  28. [28]

    Qwen3-asr technical report

    Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, Jin Xu, Jingren Zhou, and Junyang Lin. Qwen3-asr technical report. arXiv preprint arXiv:2601.21337, 2026

  29. [29]

    Bag of tricks for efficient text classification

    Armand Joulin, Edouard Grave, Piotr Bojanowski, and T omas Mikolov . Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: V olume 2, Short Papers, pages 427–431. Association for Computational Linguistics, April 2017

  30. [30]

    Fasttext.zip: Compressing text classification models

    Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and T omas Mikolov . Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016

  31. [31]

    Scaling speech technology to 1,000+ languages

    Vineel Pratap, Andros Tjandra, Bowen Shi, Paden T omasello, Arun Babu, Sayani Kundu, Ali Elkahky , Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. Scaling speech technology to 1,000+ languages. arXiv, 2023. 20

  32. [32]

    Scaling speech technology to 1,000+ languages

    Vineel Pratap, Andros Tjandra, Bowen Shi, Paden T omasello, Arun Babu, Sayani Kundu, Ali Elkahky , Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25(97):1–52,

  33. [33]

    URL http://jmlr.org/papers/v25/23-1318.html

  34. [34]

    Leveraging self- supervised learning for speaker diarization

    Jiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Diez, and Lukáš Burget. Leveraging self- supervised learning for speaker diarization. In Proc. ICASSP, 2025

  35. [35]

    Detect any sound: Open-vocabulary sound event detection with multi-modal queries, 2025

    Pengfei Cai, Yan Song, Qing Gu, Nan Jiang, Haoyu Song, and Ian McLoughlin. Detect any sound: Open-vocabulary sound event detection with multi-modal queries, 2025. URL https://arxiv.org/abs/2507.16343

  36. [36]

    Timeaudio: Bridging temporal gaps in large audio-language models

    Hualei Wang, Yiming Li, Shuo Ma, Hong Liu, and Xiangdong Wang. Timeaudio: Bridging temporal gaps in large audio-language models. arXiv preprint arXiv:2511.11039, 2025

  37. [37]

    Bryan, Zeyu Jin, and Justin Salamon

    Sonal Kumar, Prem Seetharaman, Ke Chen, Oriol Nieto, Jiaqi Su, Zhepei Wang, Rithesh Kumar, Dinesh Manocha, Nicholas J. Bryan, Zeyu Jin, and Justin Salamon. Tac: Timestamped audio captioning, 2026. URL https://arxiv.org/ abs/2602.15766

  38. [38]

    Music flamingo: Scaling music understanding in audio language models, 2025

    Sreyan Ghosh, Arushi Goel, Lasha Koroshinadze, Sang gil Lee, Zhifeng Kong, Joao Felipe Santos, Ramani Du- raiswami, Dinesh Manocha, Wei Ping, Mohammad Shoeybi, and Bryan Catanzaro. Music flamingo: Scaling music understanding in audio language models, 2025. URL https://arxiv.org/abs/2511.10289

  39. [39]

    Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro

    Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S. Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro. Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities. In Proceedings of the 42nd International Conference on Machine Learning , Proceedings of Machine Learning Research. PMLR, 2025

  40. [40]

    Approximate note transcription for the improved identification of difficult chords

    Matthias Mauch and Simon Dixon. Approximate note transcription for the improved identification of difficult chords. In Proceedings of the 11th International Society for Music Information Retrieval Conference, pages 135–140, Utrecht, The Netherlands, 2010

  41. [41]

    Beatnet: Crnn and particle filtering for online joint beat downbeat and meter tracking

    Mojtaba Heydari, Frank Cwitkowitz, and Zhiyao Duan. Beatnet: Crnn and particle filtering for online joint beat downbeat and meter tracking. 2021

  42. [42]

    madmom: a new python audio and music signal processing library , 2016

    Sebastian Böck, Filip Korzeniowski, Jan Schlüter, Florian Krebs, and Gerhard Widmer. madmom: a new python audio and music signal processing library , 2016. URL https://arxiv.org/abs/1605.07008

  43. [43]

    Essentia: an open-source library for sound and music analysis

    Dmitry Bogdanov , Nicolas Wack, Emilia Gómez, Sankalp Gulati, Perfecto Herrera, Oscar Mayor, Gerard Roma, Justin Salamon, José Zapata, and Xavier Serra. Essentia: an open-source library for sound and music analysis. In Proceedings of the 21st ACM International Conference on Multimedia , MM ’13, page 855 ⚶858, New York, NY , USA,

  44. [44]

    ISBN 9781450324045

    Association for Computing Machinery. ISBN 9781450324045. doi: 10.1145/2502081.2502229. URL https: //doi.org/10.1145/2502081.2502229

  45. [45]

    Codified audio language modeling learns useful representa- tions for music information retrieval

    Rodrigo Castellon, Chris Donahue, and Percy Liang. Codified audio language modeling learns useful representa- tions for music information retrieval. In ISMIR, 2021

  46. [46]

    Songformer: Scaling music structure analysis with heterogeneous supervision, 2026

    Chunbo Hao, Ruibin Yuan, Jixun Yao, Qixin Deng, Xinyi Bai, Yanbo Wang, Wei Xue, and Lei Xie. Songformer: Scaling music structure analysis with heterogeneous supervision, 2026. URL https://arxiv.org/abs/2510.02797

  47. [47]

    Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing

    Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, T om Ko, Qing Li, Yu Zhang, et al. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. arXiv preprint arXiv:2110.07205, 2021

  48. [48]

    Unified speech-text pre-training for speech translation and recognition

    Yun Tang, Hongyu Gong, Ning Dong, Changhan Wang, Wei-Ning Hsu, Jiatao Gu, Alexei Baevski, Xian Li, Ab- delrahman Mohamed, Michael Auli, et al. Unified speech-text pre-training for speech translation and recognition. arXiv preprint arXiv:2204.05409, 2022

  49. [49]

    Speechgpt: Empow- ering large language models with intrinsic cross-modal conversational abilities

    Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Empow- ering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000 , 2023. 21

  50. [50]

    Spirit-lm: Interleaved spoken and written language model

    T u Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R Costa-Jussa, Maha Elbayad, Sravya Popuri, Christophe Ropers, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov , et al. Spirit-lm: Interleaved spoken and written language model. T ransactions of the Association for Computational Linguistics, 13:30–52, 2025

  51. [51]

    Moshi: a speech-text foundation model for real-time dialogue

    Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037 , 2024

  52. [52]

    Mini-omni: Language models can hear, talk while thinking in streaming

    Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in streaming. arXiv preprint arXiv:2408.16725, 2024

  53. [53]

    Glm-4-voice: T owards intelligent and human-like end-to-end spoken chatbot

    Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. Glm-4-voice: T owards intelligent and human-like end-to-end spoken chatbot. arXiv preprint arXiv:2412.02612, 2024

  54. [54]

    Baichuan-audio: A unified framework for end-to-end speech interaction

    Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, et al. Baichuan-audio: A unified framework for end-to-end speech interaction. arXiv preprint arXiv:2502.17239, 2025

  55. [55]

    Step-audio: Unified understanding and generation in intelligent speech interaction

    Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, et al. Step-audio: Unified understanding and generation in intelligent speech interaction. arXiv preprint arXiv:2502.11946, 2025

  56. [56]

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov , and Abdelrah- man Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing , 29:3451–3460, 2021

  57. [57]

    CLAP: Learning audio concepts from natural language supervision

    Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. CLAP: Learning audio concepts from natural language supervision. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2023

  58. [58]

    Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

    Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Marianna Nezhurina, Taylor Berg-Kirkpatrick, and Shlomo Dub- nov . Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing , 2023

  59. [59]

    High fidelity neural audio compression

    Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022

  60. [60]

    High-fidelity audio compression with improved rvqgan

    Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High-fidelity audio compression with improved rvqgan. Advances in Neural Information Processing Systems, 36:27980–27993, 2023

  61. [61]

    Speechtokenizer: Unified speech tokenizer for speech large language models

    Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu. Speechtokenizer: Unified speech tokenizer for speech large language models. arXiv preprint arXiv:2308.16692, 2023

  62. [62]

    Codec does matter: Exploring the semantic shortcoming of codec for audio language model

    Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, et al. Codec does matter: Exploring the semantic shortcoming of codec for audio language model. InProceedings of the AAAI Conference on Artificial Intelligence , volume 39, pages 25697–25705, 2025

  63. [63]

    Spotsound: Enhancing large audio- language models with fine-grained temporal grounding

    Luoyi Sun, Xiao Zhou, Zeqian Li, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spotsound: Enhancing large audio- language models with fine-grained temporal grounding. arXiv preprint arXiv:2604.13023, 2026

  64. [64]

    Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability

    Shimin Chen, Xiaohan Lan, Yitian Yuan, Zequn Jie, and Lin Ma. Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability. arXiv preprint arXiv:2411.18211, 2024

  65. [65]

    The interspeech 2026 audio encoder capability challenge for large audio lan- guage models, 2026

    Heinrich Dinkel, Jiahao Zhou, Guanbo Wang, Yadong Niu, Junbo Zhang, Yufeng Hao, Ying Liu, Ke Li, Wenwu Wang, Zhiyong Wu, and Jian Luan. The interspeech 2026 audio encoder capability challenge for large audio lan- guage models, 2026. URL https://arxiv.org/abs/2603.22728

  66. [66]

    Mecat: A multi-experts constructed benchmark for fine-grained audio understanding tasks

    Yadong Niu, Tianzi Wang, Heinrich Dinkel, Xingwei Sun, Jiahao Zhou, Gang Li, Jizhong Liu, Xunying Liu, Junbo Zhang, and Jian Luan. Mecat: A multi-experts constructed benchmark for fine-grained audio understanding tasks. arXiv preprint arXiv:2507.23511, 2025. 22 A Additional Details A.1 Evaluation Prompts Shared Audio-Text Evaluation Template [system] You ...