MOSS-Audio Technical Report

Chenghao Liu; Chen Yang; Chufan Yu; Donghua Yu; Hanfu Chen; Jie Zhu; Jingqi Chen; Jun Zhan; Kang Yu; Ke Chen

arxiv: 2606.01802 · v3 · pith:QE3OTD3Qnew · submitted 2026-06-01 · 💻 cs.SD · cs.AI

MOSS-Audio Technical Report

Chen Yang , Chufan Yu , Hanfu Chen , Jie Zhu , Jingqi Chen , Ke Chen , Wenxuan Wang , Yang Wang

show 22 more authors

Yaozhou Jiang Yi Jiang Zhengyuan Lin Ziqi Chen Zhaoye Fei Chenghao Liu Donghua Yu Jun Zhan Kang Yu Kexin Huang Liwei Fan Mingshu Chen Qinyuan Cheng Ruixiao Li Shimin Li Songlin Wang Xingjian Zhao Yang Gao Yitian Gong Yiyang Zhang Zhe Xu Xipeng Qiu

This is my paper

Pith reviewed 2026-06-28 13:03 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords audio-language modelspeech understandingaudio captioningtimestamped ASRtemporal groundingmultimodal modelvoice agentsevent annotation

0 comments

The pith

MOSS-Audio couples an audio encoder to a language model with cross-layer injection and explicit time markers to support captioning, transcription, and reasoning over speech, sounds, and music.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MOSS-Audio as a single model that processes raw audio into text outputs for multiple tasks. It feeds 12.5 Hz encoder features through a modality adapter into an LLM decoder, then adds two mechanisms to improve temporal and acoustic fidelity. DeepStack pulls features from several encoder layers at once, while time markers are inserted directly into the token stream. The training data comes from an annotation pipeline that splits audio at natural event boundaries and merges speech, music, and general-sound captions. After large-scale pretraining and staged post-training, the model reports strong results on general audio understanding, speech captioning, ASR, and timestamped ASR.

Core claim

MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR by coupling a dedicated audio encoder with a modality adapter and a large language model, incorporating DeepStack cross-layer feature injection and time markers, and using an event-preserving audio annotation pipeline for pretraining and SFT data construction.

What carries the argument

DeepStack cross-layer feature injection together with inserted time markers, which supplies the decoder with acoustic features from multiple encoder depths and explicit temporal position cues.

If this is right

The model supports time-aware question answering and audio-grounded reasoning after multi-stage post-training.
Both 4B and 8B parameter versions are released in Instruct and Thinking configurations.
Intermediate branch-specific captions are retained to build task-oriented supervised fine-tuning data.
Time-aware objectives during pretraining enable temporal grounding in the generated outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Voice-agent systems could treat MOSS-Audio as a reusable audio-understanding base rather than training separate models per task.
The retained branch-specific captions may allow targeted fine-tuning for speech-only or music-only applications without full retraining.
Similar cross-layer injection and marker techniques could be tested on other encoder-decoder pairs beyond the current audio setup.
Scaling the pretraining data volume while keeping the same annotation pipeline would test whether the reported gains persist at larger sizes.

Load-bearing premise

The combination of DeepStack injection, time markers, and the event-preserving annotation pipeline is responsible for the measured performance gains.

What would settle it

An ablation that removes DeepStack, time markers, or the event-boundary segmentation step and shows no drop in scores on the reported audio tasks.

read the original abstract

MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates autoregressive text outputs. Two design choices are central to the system: DeepStack cross-layer feature injection, which exposes the decoder to acoustic information from multiple encoder depths, and time markers, which provide explicit temporal cues by inserting timestamp markers into the audio-token stream. At the data level, we design an event-preserving audio annotation pipeline that segments raw audio at coherent event boundaries, applies branch-specific annotation to speech, music, and general audio, and merges the results into unified captions for pretraining. The intermediate branch-specific captions are further retained to support the construction of task-oriented SFT data. The model is pretrained on large-scale audio-language data, with time-aware objectives incorporated to support temporal grounding, and then undergoes multi-stage post-training to enhance instruction following and audio-grounded reasoning. We release 4B and 8B variants in both Instruct and Thinking configurations. MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a promising understanding foundation for future voice agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MOSS-Audio is a standard audio-LLM with DeepStack injection and time markers, but the report supplies no benchmarks or ablations to support any performance claims.

read the letter

The main takeaway is that this technical report describes MOSS-Audio, which follows the encoder-adapter-LLM pattern common in audio-language models and adds DeepStack cross-layer feature injection plus explicit time markers in the token stream. An event-preserving annotation pipeline that segments at coherent boundaries and produces branch-specific captions for speech, music, and general audio is also detailed.

The paper does a solid job walking through the full system: 12.5 Hz encoder outputs, modality adapter, time-aware pretraining objectives, multi-stage post-training for instruction following and reasoning, and the release of 4B and 8B Instruct and Thinking variants. The data pipeline that retains intermediate branch captions for later SFT is a practical touch that could help others build similar datasets.

The clear weakness is the total lack of evidence. The abstract asserts strong performance on audio understanding, captioning, ASR, and timestamped ASR, yet the text contains no tables, no numbers, no baselines, and no ablations isolating DeepStack, the time markers, or the annotation choices. The stress-test note is accurate on this point; without controlled comparisons it is impossible to tell whether those elements drive any gains or whether the model simply benefits from scale and data volume.

This report is aimed at engineers building voice agents who might want implementation details or a starting checkpoint. A reader focused on new methods or verified improvements will find little of value. It does not merit peer review in its current form because the central claims rest on unevaluated assertions rather than data.

Referee Report

2 major / 0 minor

Summary. The manuscript presents MOSS-Audio, a unified audio-language model coupling a 12.5 Hz audio encoder, modality adapter, and LLM decoder. It highlights two central design choices—DeepStack cross-layer feature injection and explicit time markers—plus an event-preserving annotation pipeline that segments audio at event boundaries and produces branch-specific captions for pretraining and SFT. The model is pretrained with time-aware objectives and post-trained for instruction following; 4B and 8B Instruct and Thinking variants are released. The abstract asserts that the system achieves strong performance on general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a foundation for voice agents.

Significance. If the performance claims and the causal contribution of the three highlighted design choices were substantiated by controlled benchmarks and ablations, the work would supply a concrete, temporally grounded audio-language model that could serve as a reusable backbone for downstream voice agents. The explicit retention of intermediate branch-specific captions for SFT data construction is a practical strength that could be reused by others.

major comments (2)

[Abstract] Abstract: The central claim that DeepStack, time markers, and the event-preserving pipeline produce the reported performance gains is unsupported by any quantitative evidence. No benchmarks, baseline comparisons, ablation tables, or error bars are supplied to isolate the contribution of these components versus scale, data volume, or the base LLM, rendering the positioning as a 'promising understanding foundation' unevaluable.
[Abstract] Abstract (and implied results sections): The statement that the model 'achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR' is presented without any task-specific metrics, datasets, or comparison models, which is load-bearing for the paper's primary assertion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that the performance-related claims lack supporting quantitative evidence and will revise the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that DeepStack, time markers, and the event-preserving pipeline produce the reported performance gains is unsupported by any quantitative evidence. No benchmarks, baseline comparisons, ablation tables, or error bars are supplied to isolate the contribution of these components versus scale, data volume, or the base LLM, rendering the positioning as a 'promising understanding foundation' unevaluable.

Authors: We agree with this assessment. The manuscript does not contain ablations, benchmarks, or quantitative comparisons that isolate the contributions of DeepStack, time markers, or the annotation pipeline. We will revise the abstract to describe these design choices and the overall architecture without asserting that they produce specific performance gains relative to scale or other factors. revision: yes
Referee: [Abstract] Abstract (and implied results sections): The statement that the model 'achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR' is presented without any task-specific metrics, datasets, or comparison models, which is load-bearing for the paper's primary assertion.

Authors: This comment is correct. The abstract currently includes an unsupported claim of strong performance. We will revise the abstract to state that the model supports and has been trained for the listed tasks (general audio understanding, speech captioning, ASR, and timestamped ASR) while removing the assertion of strong performance in the absence of reported metrics or comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity in system description

full rationale

The paper is a technical report describing an audio-language model architecture, data pipeline, and training stages with no equations, derivations, predictions, or first-principles results. Claims of performance are presented as empirical outcomes rather than derived quantities. No load-bearing steps reduce to inputs by construction, and no self-citation chains or ansatzes are invoked in a manner that creates circularity. The central description remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5896 in / 1058 out tokens · 24194 ms · 2026-06-28T13:03:26.576590+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 4 canonical work pages

[1]

MMSU: A massive multi-task spoken language understanding and reasoning benchmark

Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. MMSU: A massive multi-task spoken language understanding and reasoning benchmark. arXiv preprint arXiv:2506.04779 , 2025

Pith/arXiv arXiv 2025
[2]

Audio set: An ontology and human-labeled dataset for audio events

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages 776–780. IEEE, 2017

2017
[3]

Audiocaps: Generating captions for audios in the wild,

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. AudioCaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 119–132. Association for Computational Linguistics, 2019. doi: 10.18653/v1/N19-1011

work page doi:10.18653/v1/n19-1011 2019
[4]

pushing-out

Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, T om Ko, Chengqi Zhao, Mark D. Plumbley , Yuexian Zou, and Wenwu Wang. WavCaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. IEEE/ACM T ransactions on Audio, Speech, and Language Processing, 2024. doi: 10.1109/TASLP. 2024.3419446

work page doi:10.1109/taslp 2024
[5]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey , and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning , pages 28492–28518. PMLR, 2023

2023
[6]

Qwen- Audio: Advancing universal audio understanding via unified large-scale audio-language models

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhĳie Yan, Chang Zhou, and Jingren Zhou. Qwen- Audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919, 2023

Pith/arXiv arXiv 2023
[7]

SALMONN: T owards generic hearing abilities for large language models

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. SALMONN: T owards generic hearing abilities for large language models. In The T welfth International Conference on Learning Representations, 2024

2024
[8]

Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities

Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research , pages 25125–25148. PMLR, 2024

2024
[9]

Qwen2-audio technical report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv , Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report. arXiv preprint arXiv:2407.10759, 2024

Pith/arXiv arXiv 2024
[10]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215, 2025

Pith/arXiv arXiv 2025
[11]

Sakshi, Oriol Nieto, Ra- mani Duraiswami, and Dinesh Manocha

Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh T yagi, S. Sakshi, Oriol Nieto, Ra- mani Duraiswami, and Dinesh Manocha. GAMA: A large audio-language model with advanced audio understand- ing and complex reasoning abilities. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association fo...

2024
[12]

Audio flamingo next: Next- generation open audio-language models for speech, sound, and music

Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar, Lasha Koroshinadze, Nishit Anand, Zhifeng Kong, Siddharth Gururani, Sang-gil Lee, Jaehyeon Kim, Aya Aljafari, Chao-Han Huck Yang, Sungwon Kim, Ramani Duraiswami, Dinesh Manocha, Mohammad Shoeybi, Bryan Catanzaro, Ming-Yu Liu, and Wei Ping. Audio flamingo next: Next- generation open audio-language models for ...

Pith/arXiv arXiv 2026
[13]

Enhancing temporal understanding in audio question answer- ing for large audio language models

Arvind Krishna Sridhar, Yinyi Guo, and Erik Visser. Enhancing temporal understanding in audio question answer- ing for large audio language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language T echnologies, Industry T rack, pages 1026–1035. Association for Co...

2025
[14]

Sakshi, Utkarsh T yagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha

S. Sakshi, Utkarsh T yagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. MMAU: A massive multi-task audio understanding and reasoning benchmark. In The Thirteenth International Conference on Learning Representations , 2025

2025
[15]

MMAR: A challenging benchmark for deep reasoning in speech, audio, music, and their mix

Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, et al. MMAR: A challenging benchmark for deep reasoning in speech, audio, music, and their mix. arXiv preprint arXiv:2505.13032, 2025. 19

arXiv 2025
[16]

Liu, Hongyin Luo, Leonid Karlinsky , and James Glass

Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky , and James Glass. Joint audio and speech under- standing. arXiv preprint arXiv:2309.14405, 2023

arXiv 2023
[17]

DeepStack: Deeply stacking visual tokens is surprisingly simple and effective for large multimodal models

Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zuxuan Wu, Jianfeng Gao, and Yu-Gang Jiang. DeepStack: Deeply stacking visual tokens is surprisingly simple and effective for large multimodal models. In Advances in Neural Information Processing Systems, 2024

2024
[18]

MMAU-Pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence

Sonal Kumar, Šimon Sedláček, Vaibhavi Lokegaonkar, Fernando López, Wenyi Yu, Nishit Anand, Hyeonggon Ryu, Lichang Chen, Maxim Plička, Miroslav Hlaváček, et al. MMAU-Pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence. arXiv preprint arXiv:2508.13992, 2025

arXiv 2025
[19]

w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

Ankita Pasad, Ju-Chieh Chou, and Karen Livescu. Layer-wise analysis of a self-supervised speech representation model. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021, Cartagena, Colombia, December 13-17, 2021 , pages 914–921. IEEE, 2021. doi: 10.1109/ASRU51503.2021.9688093. URL https://doi.org/10.1109/ ASRU51503.2021.9688093

work page doi:10.1109/asru51503.2021.9688093 2021
[20]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Y oshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected T opics in Signal Processing, 16(6):1505–1518, 2022

2022
[21]

Superb: Speech processing universal performance benchmark

Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y Lin, Andy T Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, et al. Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051, 2021

arXiv 2021
[22]

Qwen3-vl technical report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025
[23]

Moss transcribe diarize technical report

Donghua Yu, Zhengyuan Lin, Chen Yang, Yiyang Zhang, Hanfu Chen, et al. Moss transcribe diarize technical report. arXiv preprint arXiv:2601.01554, 2026

arXiv 2026
[24]

BEATs: Audio pre-training with acoustic tokenizers

Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel T ompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei. BEATs: Audio pre-training with acoustic tokenizers. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research , pages 5178–5193. PMLR, 2023

2023
[25]

Effective pre-training of audio transformers for sound event detection, 2024

Florian Schmid, T obias Morocutti, Francesco Foscarin, Jan Schlüter, Paul Primus, and Gerhard Widmer. Effective pre-training of audio transformers for sound event detection, 2024. URL https://arxiv.org/abs/2409.09546

arXiv 2024
[26]

Qwen3-omni technical report, 2025

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv , Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo...

Pith/arXiv arXiv 2025
[27]

Fun-asr technical report, 2025

Keyu An, Yanni Chen, Zhigao Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Bo Gong, Xiangang Li, Yabin Li, Ying Liu, Xiang Lv , Yunjie Ji, Yiheng Jiang, Bin Ma, Haoneng Luo, Chongjia Ni, Zexu Pan, Yiping Peng, Zhendong Peng, Peiyao Wang, Hao Wang, Haoxu Wang, Wen Wang, Wupeng Wang, Yuzhong Wu, Biao Tian, Zhentao Tan, Nan Yang, Bin Yuan, Jieping Ye...

arXiv 2025
[28]

Qwen3-asr technical report

Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, Jin Xu, Jingren Zhou, and Junyang Lin. Qwen3-asr technical report. arXiv preprint arXiv:2601.21337, 2026

Pith/arXiv arXiv 2026
[29]

Bag of tricks for efficient text classification

Armand Joulin, Edouard Grave, Piotr Bojanowski, and T omas Mikolov . Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: V olume 2, Short Papers, pages 427–431. Association for Computational Linguistics, April 2017

2017
[30]

Fasttext.zip: Compressing text classification models

Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthĳs Douze, Hérve Jégou, and T omas Mikolov . Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016

Pith/arXiv arXiv 2016
[31]

Scaling speech technology to 1,000+ languages

Vineel Pratap, Andros Tjandra, Bowen Shi, Paden T omasello, Arun Babu, Sayani Kundu, Ali Elkahky , Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. Scaling speech technology to 1,000+ languages. arXiv, 2023. 20

2023
[32]

Scaling speech technology to 1,000+ languages

Vineel Pratap, Andros Tjandra, Bowen Shi, Paden T omasello, Arun Babu, Sayani Kundu, Ali Elkahky , Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25(97):1–52,
[33]

URL http://jmlr.org/papers/v25/23-1318.html
[34]

Leveraging self- supervised learning for speaker diarization

Jiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Diez, and Lukáš Burget. Leveraging self- supervised learning for speaker diarization. In Proc. ICASSP, 2025

2025
[35]

Detect any sound: Open-vocabulary sound event detection with multi-modal queries, 2025

Pengfei Cai, Yan Song, Qing Gu, Nan Jiang, Haoyu Song, and Ian McLoughlin. Detect any sound: Open-vocabulary sound event detection with multi-modal queries, 2025. URL https://arxiv.org/abs/2507.16343

arXiv 2025
[36]

Timeaudio: Bridging temporal gaps in large audio-language models

Hualei Wang, Yiming Li, Shuo Ma, Hong Liu, and Xiangdong Wang. Timeaudio: Bridging temporal gaps in large audio-language models. arXiv preprint arXiv:2511.11039, 2025

arXiv 2025
[37]

Bryan, Zeyu Jin, and Justin Salamon

Sonal Kumar, Prem Seetharaman, Ke Chen, Oriol Nieto, Jiaqi Su, Zhepei Wang, Rithesh Kumar, Dinesh Manocha, Nicholas J. Bryan, Zeyu Jin, and Justin Salamon. Tac: Timestamped audio captioning, 2026. URL https://arxiv.org/ abs/2602.15766

arXiv 2026
[38]

Music flamingo: Scaling music understanding in audio language models, 2025

Sreyan Ghosh, Arushi Goel, Lasha Koroshinadze, Sang gil Lee, Zhifeng Kong, Joao Felipe Santos, Ramani Du- raiswami, Dinesh Manocha, Wei Ping, Mohammad Shoeybi, and Bryan Catanzaro. Music flamingo: Scaling music understanding in audio language models, 2025. URL https://arxiv.org/abs/2511.10289

arXiv 2025
[39]

Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro

Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S. Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro. Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities. In Proceedings of the 42nd International Conference on Machine Learning , Proceedings of Machine Learning Research. PMLR, 2025

2025
[40]

Approximate note transcription for the improved identification of difficult chords

Matthias Mauch and Simon Dixon. Approximate note transcription for the improved identification of difficult chords. In Proceedings of the 11th International Society for Music Information Retrieval Conference, pages 135–140, Utrecht, The Netherlands, 2010

2010
[41]

Beatnet: Crnn and particle filtering for online joint beat downbeat and meter tracking

Mojtaba Heydari, Frank Cwitkowitz, and Zhiyao Duan. Beatnet: Crnn and particle filtering for online joint beat downbeat and meter tracking. 2021

2021
[42]

madmom: a new python audio and music signal processing library , 2016

Sebastian Böck, Filip Korzeniowski, Jan Schlüter, Florian Krebs, and Gerhard Widmer. madmom: a new python audio and music signal processing library , 2016. URL https://arxiv.org/abs/1605.07008

Pith/arXiv arXiv 2016
[43]

Essentia: an open-source library for sound and music analysis

Dmitry Bogdanov , Nicolas Wack, Emilia Gómez, Sankalp Gulati, Perfecto Herrera, Oscar Mayor, Gerard Roma, Justin Salamon, José Zapata, and Xavier Serra. Essentia: an open-source library for sound and music analysis. In Proceedings of the 21st ACM International Conference on Multimedia , MM ’13, page 855 ⚶858, New York, NY , USA,
[44]

ISBN 9781450324045

Association for Computing Machinery. ISBN 9781450324045. doi: 10.1145/2502081.2502229. URL https: //doi.org/10.1145/2502081.2502229

work page doi:10.1145/2502081.2502229
[45]

Codified audio language modeling learns useful representa- tions for music information retrieval

Rodrigo Castellon, Chris Donahue, and Percy Liang. Codified audio language modeling learns useful representa- tions for music information retrieval. In ISMIR, 2021

2021
[46]

Songformer: Scaling music structure analysis with heterogeneous supervision, 2026

Chunbo Hao, Ruibin Yuan, Jixun Yao, Qixin Deng, Xinyi Bai, Yanbo Wang, Wei Xue, and Lei Xie. Songformer: Scaling music structure analysis with heterogeneous supervision, 2026. URL https://arxiv.org/abs/2510.02797

Pith/arXiv arXiv 2026
[47]

Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing

Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, T om Ko, Qing Li, Yu Zhang, et al. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. arXiv preprint arXiv:2110.07205, 2021

arXiv 2021
[48]

Unified speech-text pre-training for speech translation and recognition

Yun Tang, Hongyu Gong, Ning Dong, Changhan Wang, Wei-Ning Hsu, Jiatao Gu, Alexei Baevski, Xian Li, Ab- delrahman Mohamed, Michael Auli, et al. Unified speech-text pre-training for speech translation and recognition. arXiv preprint arXiv:2204.05409, 2022

arXiv 2022
[49]

Speechgpt: Empow- ering large language models with intrinsic cross-modal conversational abilities

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Empow- ering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000 , 2023. 21

arXiv 2023
[50]

Spirit-lm: Interleaved spoken and written language model

T u Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R Costa-Jussa, Maha Elbayad, Sravya Popuri, Christophe Ropers, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov , et al. Spirit-lm: Interleaved spoken and written language model. T ransactions of the Association for Computational Linguistics, 13:30–52, 2025

2025
[51]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037 , 2024

Pith/arXiv arXiv 2024
[52]

Mini-omni: Language models can hear, talk while thinking in streaming

Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in streaming. arXiv preprint arXiv:2408.16725, 2024

arXiv 2024
[53]

Glm-4-voice: T owards intelligent and human-like end-to-end spoken chatbot

Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. Glm-4-voice: T owards intelligent and human-like end-to-end spoken chatbot. arXiv preprint arXiv:2412.02612, 2024

Pith/arXiv arXiv 2024
[54]

Baichuan-audio: A unified framework for end-to-end speech interaction

Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, et al. Baichuan-audio: A unified framework for end-to-end speech interaction. arXiv preprint arXiv:2502.17239, 2025

arXiv 2025
[55]

Step-audio: Unified understanding and generation in intelligent speech interaction

Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, et al. Step-audio: Unified understanding and generation in intelligent speech interaction. arXiv preprint arXiv:2502.11946, 2025

Pith/arXiv arXiv 2025
[56]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov , and Abdelrah- man Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing , 29:3451–3460, 2021

2021
[57]

CLAP: Learning audio concepts from natural language supervision

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. CLAP: Learning audio concepts from natural language supervision. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2023

2023
[58]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Marianna Nezhurina, Taylor Berg-Kirkpatrick, and Shlomo Dub- nov . Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing , 2023

2023
[59]

High fidelity neural audio compression

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022

Pith/arXiv arXiv 2022
[60]

High-fidelity audio compression with improved rvqgan

Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High-fidelity audio compression with improved rvqgan. Advances in Neural Information Processing Systems, 36:27980–27993, 2023

2023
[61]

Speechtokenizer: Unified speech tokenizer for speech large language models

Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu. Speechtokenizer: Unified speech tokenizer for speech large language models. arXiv preprint arXiv:2308.16692, 2023

arXiv 2023
[62]

Codec does matter: Exploring the semantic shortcoming of codec for audio language model

Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, et al. Codec does matter: Exploring the semantic shortcoming of codec for audio language model. InProceedings of the AAAI Conference on Artificial Intelligence , volume 39, pages 25697–25705, 2025

2025
[63]

Spotsound: Enhancing large audio- language models with fine-grained temporal grounding

Luoyi Sun, Xiao Zhou, Zeqian Li, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spotsound: Enhancing large audio- language models with fine-grained temporal grounding. arXiv preprint arXiv:2604.13023, 2026

Pith/arXiv arXiv 2026
[64]

Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability

Shimin Chen, Xiaohan Lan, Yitian Yuan, Zequn Jie, and Lin Ma. Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability. arXiv preprint arXiv:2411.18211, 2024

arXiv 2024
[65]

The interspeech 2026 audio encoder capability challenge for large audio lan- guage models, 2026

Heinrich Dinkel, Jiahao Zhou, Guanbo Wang, Yadong Niu, Junbo Zhang, Yufeng Hao, Ying Liu, Ke Li, Wenwu Wang, Zhiyong Wu, and Jian Luan. The interspeech 2026 audio encoder capability challenge for large audio lan- guage models, 2026. URL https://arxiv.org/abs/2603.22728

arXiv 2026
[66]

Mecat: A multi-experts constructed benchmark for fine-grained audio understanding tasks

Yadong Niu, Tianzi Wang, Heinrich Dinkel, Xingwei Sun, Jiahao Zhou, Gang Li, Jizhong Liu, Xunying Liu, Junbo Zhang, and Jian Luan. Mecat: A multi-experts constructed benchmark for fine-grained audio understanding tasks. arXiv preprint arXiv:2507.23511, 2025. 22 A Additional Details A.1 Evaluation Prompts Shared Audio-Text Evaluation Template [system] You ...

Pith/arXiv arXiv 2025

[1] [1]

MMSU: A massive multi-task spoken language understanding and reasoning benchmark

Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. MMSU: A massive multi-task spoken language understanding and reasoning benchmark. arXiv preprint arXiv:2506.04779 , 2025

Pith/arXiv arXiv 2025

[2] [2]

Audio set: An ontology and human-labeled dataset for audio events

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages 776–780. IEEE, 2017

2017

[3] [3]

Audiocaps: Generating captions for audios in the wild,

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. AudioCaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 119–132. Association for Computational Linguistics, 2019. doi: 10.18653/v1/N19-1011

work page doi:10.18653/v1/n19-1011 2019

[4] [4]

pushing-out

Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, T om Ko, Chengqi Zhao, Mark D. Plumbley , Yuexian Zou, and Wenwu Wang. WavCaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. IEEE/ACM T ransactions on Audio, Speech, and Language Processing, 2024. doi: 10.1109/TASLP. 2024.3419446

work page doi:10.1109/taslp 2024

[5] [5]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey , and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning , pages 28492–28518. PMLR, 2023

2023

[6] [6]

Qwen- Audio: Advancing universal audio understanding via unified large-scale audio-language models

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhĳie Yan, Chang Zhou, and Jingren Zhou. Qwen- Audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919, 2023

Pith/arXiv arXiv 2023

[7] [7]

SALMONN: T owards generic hearing abilities for large language models

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. SALMONN: T owards generic hearing abilities for large language models. In The T welfth International Conference on Learning Representations, 2024

2024

[8] [8]

Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities

Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research , pages 25125–25148. PMLR, 2024

2024

[9] [9]

Qwen2-audio technical report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv , Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report. arXiv preprint arXiv:2407.10759, 2024

Pith/arXiv arXiv 2024

[10] [10]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215, 2025

Pith/arXiv arXiv 2025

[11] [11]

Sakshi, Oriol Nieto, Ra- mani Duraiswami, and Dinesh Manocha

Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh T yagi, S. Sakshi, Oriol Nieto, Ra- mani Duraiswami, and Dinesh Manocha. GAMA: A large audio-language model with advanced audio understand- ing and complex reasoning abilities. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association fo...

2024

[12] [12]

Audio flamingo next: Next- generation open audio-language models for speech, sound, and music

Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar, Lasha Koroshinadze, Nishit Anand, Zhifeng Kong, Siddharth Gururani, Sang-gil Lee, Jaehyeon Kim, Aya Aljafari, Chao-Han Huck Yang, Sungwon Kim, Ramani Duraiswami, Dinesh Manocha, Mohammad Shoeybi, Bryan Catanzaro, Ming-Yu Liu, and Wei Ping. Audio flamingo next: Next- generation open audio-language models for ...

Pith/arXiv arXiv 2026

[13] [13]

Enhancing temporal understanding in audio question answer- ing for large audio language models

Arvind Krishna Sridhar, Yinyi Guo, and Erik Visser. Enhancing temporal understanding in audio question answer- ing for large audio language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language T echnologies, Industry T rack, pages 1026–1035. Association for Co...

2025

[14] [14]

Sakshi, Utkarsh T yagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha

S. Sakshi, Utkarsh T yagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. MMAU: A massive multi-task audio understanding and reasoning benchmark. In The Thirteenth International Conference on Learning Representations , 2025

2025

[15] [15]

MMAR: A challenging benchmark for deep reasoning in speech, audio, music, and their mix

Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, et al. MMAR: A challenging benchmark for deep reasoning in speech, audio, music, and their mix. arXiv preprint arXiv:2505.13032, 2025. 19

arXiv 2025

[16] [16]

Liu, Hongyin Luo, Leonid Karlinsky , and James Glass

Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky , and James Glass. Joint audio and speech under- standing. arXiv preprint arXiv:2309.14405, 2023

arXiv 2023

[17] [17]

DeepStack: Deeply stacking visual tokens is surprisingly simple and effective for large multimodal models

Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zuxuan Wu, Jianfeng Gao, and Yu-Gang Jiang. DeepStack: Deeply stacking visual tokens is surprisingly simple and effective for large multimodal models. In Advances in Neural Information Processing Systems, 2024

2024

[18] [18]

MMAU-Pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence

Sonal Kumar, Šimon Sedláček, Vaibhavi Lokegaonkar, Fernando López, Wenyi Yu, Nishit Anand, Hyeonggon Ryu, Lichang Chen, Maxim Plička, Miroslav Hlaváček, et al. MMAU-Pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence. arXiv preprint arXiv:2508.13992, 2025

arXiv 2025

[19] [19]

w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

Ankita Pasad, Ju-Chieh Chou, and Karen Livescu. Layer-wise analysis of a self-supervised speech representation model. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021, Cartagena, Colombia, December 13-17, 2021 , pages 914–921. IEEE, 2021. doi: 10.1109/ASRU51503.2021.9688093. URL https://doi.org/10.1109/ ASRU51503.2021.9688093

work page doi:10.1109/asru51503.2021.9688093 2021

[20] [20]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Y oshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected T opics in Signal Processing, 16(6):1505–1518, 2022

2022

[21] [21]

Superb: Speech processing universal performance benchmark

Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y Lin, Andy T Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, et al. Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051, 2021

arXiv 2021

[22] [22]

Qwen3-vl technical report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025

[23] [23]

Moss transcribe diarize technical report

Donghua Yu, Zhengyuan Lin, Chen Yang, Yiyang Zhang, Hanfu Chen, et al. Moss transcribe diarize technical report. arXiv preprint arXiv:2601.01554, 2026

arXiv 2026

[24] [24]

BEATs: Audio pre-training with acoustic tokenizers

Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel T ompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei. BEATs: Audio pre-training with acoustic tokenizers. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research , pages 5178–5193. PMLR, 2023

2023

[25] [25]

Effective pre-training of audio transformers for sound event detection, 2024

Florian Schmid, T obias Morocutti, Francesco Foscarin, Jan Schlüter, Paul Primus, and Gerhard Widmer. Effective pre-training of audio transformers for sound event detection, 2024. URL https://arxiv.org/abs/2409.09546

arXiv 2024

[26] [26]

Qwen3-omni technical report, 2025

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv , Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo...

Pith/arXiv arXiv 2025

[27] [27]

Fun-asr technical report, 2025

Keyu An, Yanni Chen, Zhigao Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Bo Gong, Xiangang Li, Yabin Li, Ying Liu, Xiang Lv , Yunjie Ji, Yiheng Jiang, Bin Ma, Haoneng Luo, Chongjia Ni, Zexu Pan, Yiping Peng, Zhendong Peng, Peiyao Wang, Hao Wang, Haoxu Wang, Wen Wang, Wupeng Wang, Yuzhong Wu, Biao Tian, Zhentao Tan, Nan Yang, Bin Yuan, Jieping Ye...

arXiv 2025

[28] [28]

Qwen3-asr technical report

Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, Jin Xu, Jingren Zhou, and Junyang Lin. Qwen3-asr technical report. arXiv preprint arXiv:2601.21337, 2026

Pith/arXiv arXiv 2026

[29] [29]

Bag of tricks for efficient text classification

Armand Joulin, Edouard Grave, Piotr Bojanowski, and T omas Mikolov . Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: V olume 2, Short Papers, pages 427–431. Association for Computational Linguistics, April 2017

2017

[30] [30]

Fasttext.zip: Compressing text classification models

Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthĳs Douze, Hérve Jégou, and T omas Mikolov . Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016

Pith/arXiv arXiv 2016

[31] [31]

Scaling speech technology to 1,000+ languages

Vineel Pratap, Andros Tjandra, Bowen Shi, Paden T omasello, Arun Babu, Sayani Kundu, Ali Elkahky , Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. Scaling speech technology to 1,000+ languages. arXiv, 2023. 20

2023

[32] [32]

Scaling speech technology to 1,000+ languages

Vineel Pratap, Andros Tjandra, Bowen Shi, Paden T omasello, Arun Babu, Sayani Kundu, Ali Elkahky , Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25(97):1–52,

[33] [33]

URL http://jmlr.org/papers/v25/23-1318.html

[34] [34]

Leveraging self- supervised learning for speaker diarization

Jiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Diez, and Lukáš Burget. Leveraging self- supervised learning for speaker diarization. In Proc. ICASSP, 2025

2025

[35] [35]

Detect any sound: Open-vocabulary sound event detection with multi-modal queries, 2025

Pengfei Cai, Yan Song, Qing Gu, Nan Jiang, Haoyu Song, and Ian McLoughlin. Detect any sound: Open-vocabulary sound event detection with multi-modal queries, 2025. URL https://arxiv.org/abs/2507.16343

arXiv 2025

[36] [36]

Timeaudio: Bridging temporal gaps in large audio-language models

Hualei Wang, Yiming Li, Shuo Ma, Hong Liu, and Xiangdong Wang. Timeaudio: Bridging temporal gaps in large audio-language models. arXiv preprint arXiv:2511.11039, 2025

arXiv 2025

[37] [37]

Bryan, Zeyu Jin, and Justin Salamon

Sonal Kumar, Prem Seetharaman, Ke Chen, Oriol Nieto, Jiaqi Su, Zhepei Wang, Rithesh Kumar, Dinesh Manocha, Nicholas J. Bryan, Zeyu Jin, and Justin Salamon. Tac: Timestamped audio captioning, 2026. URL https://arxiv.org/ abs/2602.15766

arXiv 2026

[38] [38]

Music flamingo: Scaling music understanding in audio language models, 2025

Sreyan Ghosh, Arushi Goel, Lasha Koroshinadze, Sang gil Lee, Zhifeng Kong, Joao Felipe Santos, Ramani Du- raiswami, Dinesh Manocha, Wei Ping, Mohammad Shoeybi, and Bryan Catanzaro. Music flamingo: Scaling music understanding in audio language models, 2025. URL https://arxiv.org/abs/2511.10289

arXiv 2025

[39] [39]

Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro

Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S. Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro. Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities. In Proceedings of the 42nd International Conference on Machine Learning , Proceedings of Machine Learning Research. PMLR, 2025

2025

[40] [40]

Approximate note transcription for the improved identification of difficult chords

Matthias Mauch and Simon Dixon. Approximate note transcription for the improved identification of difficult chords. In Proceedings of the 11th International Society for Music Information Retrieval Conference, pages 135–140, Utrecht, The Netherlands, 2010

2010

[41] [41]

Beatnet: Crnn and particle filtering for online joint beat downbeat and meter tracking

Mojtaba Heydari, Frank Cwitkowitz, and Zhiyao Duan. Beatnet: Crnn and particle filtering for online joint beat downbeat and meter tracking. 2021

2021

[42] [42]

madmom: a new python audio and music signal processing library , 2016

Sebastian Böck, Filip Korzeniowski, Jan Schlüter, Florian Krebs, and Gerhard Widmer. madmom: a new python audio and music signal processing library , 2016. URL https://arxiv.org/abs/1605.07008

Pith/arXiv arXiv 2016

[43] [43]

Essentia: an open-source library for sound and music analysis

Dmitry Bogdanov , Nicolas Wack, Emilia Gómez, Sankalp Gulati, Perfecto Herrera, Oscar Mayor, Gerard Roma, Justin Salamon, José Zapata, and Xavier Serra. Essentia: an open-source library for sound and music analysis. In Proceedings of the 21st ACM International Conference on Multimedia , MM ’13, page 855 ⚶858, New York, NY , USA,

[44] [44]

ISBN 9781450324045

Association for Computing Machinery. ISBN 9781450324045. doi: 10.1145/2502081.2502229. URL https: //doi.org/10.1145/2502081.2502229

work page doi:10.1145/2502081.2502229

[45] [45]

Codified audio language modeling learns useful representa- tions for music information retrieval

Rodrigo Castellon, Chris Donahue, and Percy Liang. Codified audio language modeling learns useful representa- tions for music information retrieval. In ISMIR, 2021

2021

[46] [46]

Songformer: Scaling music structure analysis with heterogeneous supervision, 2026

Chunbo Hao, Ruibin Yuan, Jixun Yao, Qixin Deng, Xinyi Bai, Yanbo Wang, Wei Xue, and Lei Xie. Songformer: Scaling music structure analysis with heterogeneous supervision, 2026. URL https://arxiv.org/abs/2510.02797

Pith/arXiv arXiv 2026

[47] [47]

Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing

Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, T om Ko, Qing Li, Yu Zhang, et al. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. arXiv preprint arXiv:2110.07205, 2021

arXiv 2021

[48] [48]

Unified speech-text pre-training for speech translation and recognition

Yun Tang, Hongyu Gong, Ning Dong, Changhan Wang, Wei-Ning Hsu, Jiatao Gu, Alexei Baevski, Xian Li, Ab- delrahman Mohamed, Michael Auli, et al. Unified speech-text pre-training for speech translation and recognition. arXiv preprint arXiv:2204.05409, 2022

arXiv 2022

[49] [49]

Speechgpt: Empow- ering large language models with intrinsic cross-modal conversational abilities

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Empow- ering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000 , 2023. 21

arXiv 2023

[50] [50]

Spirit-lm: Interleaved spoken and written language model

T u Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R Costa-Jussa, Maha Elbayad, Sravya Popuri, Christophe Ropers, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov , et al. Spirit-lm: Interleaved spoken and written language model. T ransactions of the Association for Computational Linguistics, 13:30–52, 2025

2025

[51] [51]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037 , 2024

Pith/arXiv arXiv 2024

[52] [52]

Mini-omni: Language models can hear, talk while thinking in streaming

Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in streaming. arXiv preprint arXiv:2408.16725, 2024

arXiv 2024

[53] [53]

Glm-4-voice: T owards intelligent and human-like end-to-end spoken chatbot

Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. Glm-4-voice: T owards intelligent and human-like end-to-end spoken chatbot. arXiv preprint arXiv:2412.02612, 2024

Pith/arXiv arXiv 2024

[54] [54]

Baichuan-audio: A unified framework for end-to-end speech interaction

Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, et al. Baichuan-audio: A unified framework for end-to-end speech interaction. arXiv preprint arXiv:2502.17239, 2025

arXiv 2025

[55] [55]

Step-audio: Unified understanding and generation in intelligent speech interaction

Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, et al. Step-audio: Unified understanding and generation in intelligent speech interaction. arXiv preprint arXiv:2502.11946, 2025

Pith/arXiv arXiv 2025

[56] [56]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov , and Abdelrah- man Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing , 29:3451–3460, 2021

2021

[57] [57]

CLAP: Learning audio concepts from natural language supervision

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. CLAP: Learning audio concepts from natural language supervision. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2023

2023

[58] [58]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Marianna Nezhurina, Taylor Berg-Kirkpatrick, and Shlomo Dub- nov . Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing , 2023

2023

[59] [59]

High fidelity neural audio compression

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022

Pith/arXiv arXiv 2022

[60] [60]

High-fidelity audio compression with improved rvqgan

Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High-fidelity audio compression with improved rvqgan. Advances in Neural Information Processing Systems, 36:27980–27993, 2023

2023

[61] [61]

Speechtokenizer: Unified speech tokenizer for speech large language models

Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu. Speechtokenizer: Unified speech tokenizer for speech large language models. arXiv preprint arXiv:2308.16692, 2023

arXiv 2023

[62] [62]

Codec does matter: Exploring the semantic shortcoming of codec for audio language model

Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, et al. Codec does matter: Exploring the semantic shortcoming of codec for audio language model. InProceedings of the AAAI Conference on Artificial Intelligence , volume 39, pages 25697–25705, 2025

2025

[63] [63]

Spotsound: Enhancing large audio- language models with fine-grained temporal grounding

Luoyi Sun, Xiao Zhou, Zeqian Li, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spotsound: Enhancing large audio- language models with fine-grained temporal grounding. arXiv preprint arXiv:2604.13023, 2026

Pith/arXiv arXiv 2026

[64] [64]

Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability

Shimin Chen, Xiaohan Lan, Yitian Yuan, Zequn Jie, and Lin Ma. Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability. arXiv preprint arXiv:2411.18211, 2024

arXiv 2024

[65] [65]

The interspeech 2026 audio encoder capability challenge for large audio lan- guage models, 2026

Heinrich Dinkel, Jiahao Zhou, Guanbo Wang, Yadong Niu, Junbo Zhang, Yufeng Hao, Ying Liu, Ke Li, Wenwu Wang, Zhiyong Wu, and Jian Luan. The interspeech 2026 audio encoder capability challenge for large audio lan- guage models, 2026. URL https://arxiv.org/abs/2603.22728

arXiv 2026

[66] [66]

Mecat: A multi-experts constructed benchmark for fine-grained audio understanding tasks

Yadong Niu, Tianzi Wang, Heinrich Dinkel, Xingwei Sun, Jiahao Zhou, Gang Li, Jizhong Liu, Xunying Liu, Junbo Zhang, and Jian Luan. Mecat: A multi-experts constructed benchmark for fine-grained audio understanding tasks. arXiv preprint arXiv:2507.23511, 2025. 22 A Additional Details A.1 Evaluation Prompts Shared Audio-Text Evaluation Template [system] You ...

Pith/arXiv arXiv 2025